40
HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Characterizing Race/Ethnicity and Genetic Ancestry for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort Yambazi Banda,* ,1 Mark N. Kvale,* Thomas J. Hoffmann,* ,Stephanie E. Hesselson,* Dilrini Ranatunga, Hua Tang, § Chiara Sabatti,** Lisa A. Croen, Brad P. Dispensa,* Mary Henderson, Carlos Iribarren, Eric Jorgenson, Lawrence H. Kushi, Dana Ludwig, Diane Olberg, Charles P. Quesenberry Jr., Sarah Rowell, Marianne Sadler, Lori C. Sakoda, Stanley Sciortino, Ling Shen, David Smethurst, Carol P. Somkin, Stephen K. Van Den Eeden, Lawrence Walter, Rachel A. Whitmer, Pui-Yan Kwok,* Catherine Schaefer, ,1,2 and Neil Risch* ,,,1,2 *Institute for Human Genetics, University of California, San Francisco, California 94143-0794, Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94158-2549, Kaiser Permanente Northern California Division of Research, Oakland, California 94612-2304, § Department of Genetics, Stanford University, Stanford, California 94305-5120, and **Department of Health Research and Policy, Stanford University, Stanford, California 94305-5405 ABSTRACT Using genome-wide genotypes, we characterized the genetic structure of 103,006 participants in the Kaiser Permanente Northern California multi-ethnic Genetic Epidemiology Research on Adult Health and Aging Cohort and analyzed the relationship to self- reported race/ethnicity. Participants endorsed any of 23 race/ethnicity/nationality categories, which were collapsed into seven major race/ ethnicity groups. By self-report the cohort is 80.8% white and 19.2% minority; 93.8% endorsed a single race/ethnicity group, while 6.2% endorsed two or more. Principal component (PC) and admixture analyses were generally consistent with prior studies. Approximately 17% of subjects had genetic ancestry from more than one continent, and 12% were genetically admixed, considering only nonadjacent geographical origins. Self-reported whites were spread on a continuum along the rst two PCs, indicating extensive mixing among European nationalities. Self-identied East Asian nationalities correlated with genetic clustering, consistent with extensive endogamy. Individuals of mixed East AsianEuropean genetic ancestry were easily identied; we also observed a modest amount of European genetic ancestry in individuals self- identied as Filipinos. Self-reported African Americans and Latinos showed extensive European and African genetic ancestry, and Native American genetic ancestry for the latter. Among 3741 genetically identied parentchild pairs, 93% were concordant for self-reported race/ ethnicity; among 2018 genetically identied full-sib pairs, 96% were concordant; the lower rate for parentchild pairs was largely due to intermarriage. The parentchild pairs revealed a trend toward increasing exogamy over time; the presence in the cohort of individuals endorsing multiple race/ethnicity categories creates interesting challenges and future opportunities for genetic epidemiologic studies. KEYWORDS RPGEH GERA; population structure; principal components; admixture; race/ethnicity P OPULATION genetic structure analyses have recently increased in number due to improvements in capabilities to perform large-scale genomic investigations. Technological developments have improved our ability to address questions associated with phenotypic variation (Wellcome Trust Case Consortium 2007), human genetic variation (Jakobsson et al. 2008; Li et al. 2008), and evolution (Lohmueller et al. 2008). These studies play an important role in a variety of applied settings, including genome-wide association studies (GWAS), admixture analyses, and dissection of traits associated with ancestry. For example, in association studies, error rates due Copyright © 2015 by the Genetics Society of America doi: 10.1534/genetics.115.178616 Manuscript received January 16, 2015; accepted for publication June 2, 2015; published Early Online June 19, 2015. Supporting information is available online at www.genetics.org/lookup/suppl/ doi:10.1534/genetics.115.178616/-/DC1. The accession number for the Research Program on Genes, Environment, and Health Parent Level/Study on the database of Genotypes and Phenotypes (dbGaP) is phs00078. The title of the dbGaP Parent study is The Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH).GO Project IRB: CN-09CScha-06-H. 1 Corresponding authors: University of California San Francisco, Institute for Human Genetics, 513 Parnassus Ave., San Francisco, CA 94143-0794. E-mail: [email protected]; University of California San Francisco, Institute for Human Genetics, 513 Parnassus Ave., San Francisco, CA 94143-0794. E-mail: [email protected]; Kaiser Permanente Northern California, Division of Research, 2000 Broadway, Oakland, CA 94612-2304. E-mail: [email protected] 2 These authors contributed equally to this work. Genetics, Vol. 200, 12851295 August 2015 1285

Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

HIGHLIGHTED ARTICLEGENETICS | INVESTIGATION

Characterizing RaceEthnicity and Genetic Ancestryfor 100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortYambazi Banda1 Mark N Kvale Thomas J Hoffmanndagger Stephanie E Hesselson Dilrini RanatungaDagger

Hua Tangsect Chiara Sabatti Lisa A CroenDagger Brad P Dispensa Mary HendersonDagger Carlos IribarrenDagger

Eric JorgensonDagger Lawrence H KushiDagger Dana LudwigDagger Diane OlbergDagger Charles P Quesenberry JrDagger

Sarah RowellDagger Marianne SadlerDagger Lori C SakodaDagger Stanley SciortinoDagger Ling ShenDagger David SmethurstDagger

Carol P SomkinDagger Stephen K Van Den EedenDagger Lawrence WalterDagger Rachel A WhitmerDagger Pui-Yan Kwok

Catherine SchaeferDagger12 and Neil RischdaggerDagger12

Institute for Human Genetics University of California San Francisco California 94143-0794 daggerDepartment of Epidemiology andBiostatistics University of California San Francisco California 94158-2549 DaggerKaiser Permanente Northern California Division of

Research Oakland California 94612-2304 sectDepartment of Genetics Stanford University Stanford California 94305-5120 andDepartment of Health Research and Policy Stanford University Stanford California 94305-5405

ABSTRACT Using genome-wide genotypes we characterized the genetic structure of 103006 participants in the Kaiser PermanenteNorthern California multi-ethnic Genetic Epidemiology Research on Adult Health and Aging Cohort and analyzed the relationship to self-reported raceethnicity Participants endorsed any of 23 raceethnicitynationality categories which were collapsed into seven major raceethnicity groups By self-report the cohort is 808 white and 192 minority 938 endorsed a single raceethnicity group while 62endorsed two or more Principal component (PC) and admixture analyses were generally consistent with prior studies Approximately 17 ofsubjects had genetic ancestry frommore than one continent and 12were genetically admixed considering only nonadjacent geographicalorigins Self-reported whites were spread on a continuum along the first two PCs indicating extensive mixing among European nationalitiesSelf-identified East Asian nationalities correlated with genetic clustering consistent with extensive endogamy Individuals of mixed East AsianndashEuropean genetic ancestry were easily identified we also observed a modest amount of European genetic ancestry in individuals self-identified as Filipinos Self-reported African Americans and Latinos showed extensive European and African genetic ancestry and NativeAmerican genetic ancestry for the latter Among 3741 genetically identified parentndashchild pairs 93 were concordant for self-reported raceethnicity among 2018 genetically identified full-sib pairs 96 were concordant the lower rate for parentndashchild pairs was largely due tointermarriage The parentndashchild pairs revealed a trend toward increasing exogamy over time the presence in the cohort of individualsendorsing multiple raceethnicity categories creates interesting challenges and future opportunities for genetic epidemiologic studies

KEYWORDS RPGEH GERA population structure principal components

admixture raceethnicity

POPULATION genetic structure analyses have recentlyincreased in number due to improvements in capabilities

to perform large-scale genomic investigations Technologicaldevelopments have improved our ability to address questionsassociated with phenotypic variation (Wellcome Trust CaseConsortium 2007) human genetic variation (Jakobsson et al2008 Li et al 2008) and evolution (Lohmueller et al 2008)These studies play an important role in a variety of appliedsettings including genome-wide association studies (GWAS)admixture analyses and dissection of traits associated withancestry For example in association studies error rates due

Copyright copy 2015 by the Genetics Society of Americadoi 101534genetics115178616Manuscript received January 16 2015 accepted for publication June 2 2015published Early Online June 19 2015Supporting information is available online at wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1The accession number for the Research Program on Genes Environment andHealth Parent LevelStudy on the database of Genotypes and Phenotypes (dbGaP)is phs00078 The title of the dbGaP Parent study is ldquoThe Kaiser PermanenteResearch Program on Genes Environment and Health (RPGEH)rdquo GO Project IRBCN-09CScha-06-H1Corresponding authors University of California San Francisco Institute for HumanGenetics 513 Parnassus Ave San Francisco CA 94143-0794E-mail bandayhumgenucsfedu University of California San Francisco Institute forHuman Genetics 513 Parnassus Ave San Francisco CA 94143-0794E-mail rischnhumgenucsfedu Kaiser Permanente Northern California Division ofResearch 2000 Broadway Oakland CA 94612-2304 E-mail cathyschaeferkporg

2These authors contributed equally to this work

Genetics Vol 200 1285ndash1295 August 2015 1285

to confounding by ancestry can be improved when popula-tion structure is taken into account (Tian et al 2008a)At the same time the relationship between self-identifiedraceethnicitynationality and genetic ancestry based ongenetic marker data has become a topic of great interest(Risch et al 2002 Burchard et al 2003 Cooper et al2003)

Studies of human evolution have typically focused onindigenous population samples broadly distributed geograph-ically across the globe One such resource that has been highlyexploited for this purpose is the Human Genome DiversityProject panel of 55 indigenous populations (Jakobsson et al2008 Li et al 2008) On the other hand GWAS utilizing US-based samples often include more heterogeneous populationsin terms of ancestry although the number of ethnic groupsincluded is typically limited

In the present study we utilize the large ethnicallydiverse Kaiser Permanente (KP) Research Program onGenes Environment and Health (RPGEH) Genetic Epide-miology Research on Adult Health and Aging (GERA)cohort to examine the question of genetic ancestry ina representative northern California population and how itrelates to racialethnic self-identification The cohort con-sists of 103006 adult members of Kaiser PermanenteNorthern California (KPNC) ranging in age from 18 to100 years at enrollment The cohort was created to enablestudies of genetic and environmental influences on manydifferent health conditions and traits by linking high-density genome-wide SNP data with comprehensive longi-tudinal clinical information from electronic health records(EHR) as well as self-reported data on demographic factorsand health behaviors from a structured survey The GERAcohort is one of the first very large multi-ethnic cohortscreated for GWAS for a wide variety of health conditionsThe cohort was genotyped using custom ancestry-specificSNP arrays to better capture rare variants specific todifferent ethnic groups and provide better genome-widecoverage thus permitting investigation of potential asso-ciations that may differ between groups Understandingand characterizing the genetic diversity within a sample isessential to GWAS since population structure both withinand between groups can lead to artifactual associationsThe multi-ethnic GERA cohort thus provides an unprece-dented opportunity to understand human genetic diversityin a US population sample This article presents theresults of analyses of population genetic structure con-firming previous observations but also adding furtherunderstanding of mixed genetic ancestry including theextent of distant vs recent admixture We also provideestimates of principal components needed for adjustmentof population structure in GWAS and examine the self-reported raceethnicity distribution of first-degree relative(parentndashchild and full sib) and monozygotic (MZ) twinpairs Finally we examine how the identified genetic struc-ture correlates with participantsrsquo self-identification in termsof raceethnicitynationality

Materials and Methods

Participants

Individuals composing the GERA cohort are participants inthe KPNC RPGEH KPNC is an integrated health care deliverysystem with 3 million members in northern California Themembership is representative of the general population withrespect to raceethnicity and socioeconomic status althoughextremes of income are under-represented (Krieger et al1993) The RPGEH was established as a resource for researchon genetic and environmental influences on health and dis-ease The development of the RPGEH and GERA cohort aredescribed elsewhere (dbGaP phs000674v1p1) Briefly adultmembers of KPNC were asked to complete a mailed surveysurvey respondents then completed a broad written consentand provided a saliva sample for extraction of DNA Partic-ipants self-reported their race ethnicity and nationality onthe survey by endorsing as many of 23 race ethnicity andnationality categories as applied (Table 1 provides a list of thechoices) Participants were asked their religion and this ques-tion in conjunction with a raceethnicitynationality questionwas used to identify Ashkenazi individuals (those whoresponded ldquoAshkenazi Jewishrdquo to the nationality question orldquoJewishrdquo to the religion question)

To maximize the diversity of the sample the GERA cohortwas formed by including all racial and ethnic minorityparticipants with saliva samples (19 of the total) theremaining participants were drawn randomly from whitenon-Hispanic participants (81 of the total) Among cohortmembers the average length of KPNC health plan member-ship was 23 years providing extensive longitudinal data ondiagnoses and procedures laboratory test results pharmaceu-tical prescriptions radiological findings and other clinical in-formation from EHR for use in GWAS of health conditions andtraits

The Human Genome Diversity Project (HGDP) (Cavalli-Sforza 2005 Li et al 2008) subjects were used to facili-tate geographic interpretation of the GERA principalcomponents

Self-reported raceethnicity

Self-reported raceethnicity for each individual was derivedfrom responses to the survey question on raceethnicitynationality (Table 1) Nationalities within a single raceethnicity group were collapsed Specifically all East Asiannationalities (codes 10ndash15) were collapsed into a single EastAsian group all Pacific Islander nationalities (codes 16ndash18)were collapsed into a single Pacific Islander group all Latinonationalities (codes 4ndash8) were collapsed into a single Latinocategory all African descent populations (codes 1ndash3) werecollapsed into a single group all white-European ethnicities(codes 20ndash22) were collapsed into a single category thesingle categories of South Asians and Native Americansremained as such A small number of individuals (1)had implausible raceethnicity responses from the survey(eg checked off every category) or specified ldquootherrdquo For

1286 Y Banda et al

these individuals we used KPNC administrative databases toassign raceethnicity For other individuals a discrepancywas observed between their original and scanned surveyresponses These subjects were also adjudicated to theiroriginal form results as described in SI Methods (File S1)

Genotyping and array assignment

To maximize genome-wide coverage of common and lesscommon variants four custom Affymetrix Axiom arrays(Hoffmann et al 2011ab) were designed for individuals ofnon-Hispanic white (EUR) East Asian (EAS) African Ameri-can (AFR) and Latino (LAT) raceethnicity The number ofSNPs varied among arrays ranging from 674518 on the EURarray to 893631 on the AFR array (Hoffmann et al 2011b)A total of 254438 SNPs were common to all four arraysGenotyping was performed at the University of CaliforniaSan Francisco and is described elsewhere (Kvale et al 2015)

The assignment of subjects to arrays was based on theraceethnicity categories formed as described above (Table S2and Table S3) Assignments were hierarchical to accommo-date individuals reporting multiple racialethnic categoriesSpecifically individuals reporting any Latino or Native Amer-ican raceethnicitynationality (possibly in combination withother racesethnicitiesnationalities) were assigned to the LATarray with the exception of individuals who reported AfricanAfrican American raceethnicity and Native American raceethnicity who were assigned to the AFR array and individualsreporting East Asian raceethnicity and Native American raceethnicity who were assigned to the EAS array All other indi-viduals reporting any African African American or Afro-Caribbean raceethnicity but no Latino raceethnicity wereassigned to the AFR array All those reporting any East Asian

but not African African American Afro-Caribbean or Latinoraceethnicity were assigned to the EAS array Subjects report-ing white-European American South Asian Middle Easternor Ashkenazi raceethnicity but none of the previously men-tioned racesethnicities were assigned to the EUR arrayTherefore for example individuals with European and EastAsian raceethnicity were assigned to the EAS array individ-uals with African American and East Asian raceethnicitynationality were analyzed on the AFR array The variousarrays were designed to allow for the relevant admixture(Hoffmann et al 2011b)

Quality control

High-quality genotype data for the GERA cohort wasobtained by systematic examination and removal of SNPgenotypes according to a specific protocol as described indetail elsewhere (Kvale et al 2015) For the genetic struc-ture analyses only SNPs that were common across all fourarrays and that had a call rate 995 were consideredThis set also excluded SNPs that showed extreme deviationfrom HardyndashWeinberg equilibrium (P 1025) This resultedin a set of 144799 high-performing SNPs used in furtheranalyses of population structure and admixture

Principal components analysis

Filtering Principal components analysis (PCA) was per-formed using the smartpca program which is part of theEIGENSOFT42 software package (Patterson et al 2006)The initial PCA runs were performed separately for individ-uals genotyped on different arrays The initial set of 144799high-performing SNPs (described above) that were commonacross all four array types was used in the preliminary

Table 1 Distribution of responses to survey question on raceethnicitynationality along with proportion female and average ages

Category No female Mean age (SE)

1 African American 3117 057 6066 (024)2 African 129 043 5290 (143)3 Afro-Caribbean 119 068 5624 (130)4 Mexican 4613 056 5667 (022)5 Central-South American 1034 070 5534 (046)6 Puerto Rican 322 069 5668 (083)7 Cuban 106 071 5541 (142)8 Other LatinoHispanic 1545 070 5741 (038)9 South AsianndashIndianPakistani 575 042 5458 (060)10 Chinese 3433 058 5675 (025)11 Japanese 1739 061 6156 (034)12 Korean 234 066 5383 (104)13 Filipino 1708 059 5559 (037)14 Vietnamese 317 050 5323 (082)15 Other Southeast Asia 176 064 5185 (110)16 Native Hawaiian 144 065 5841 (123)17 Samoan 14 064 5936 (344)18 Other Pacific Islander 132 057 5388 (135)19 Native American IndianAlaska Native 3884 066 6120 (022)20 White European American 80079 059 6327 (005)21 Middle Easterner 914 043 6218 (048)22 Ashkenazi Jewish 2399 066 6249 (028)23 Other ethnicity 75 073 5653 (164)

Population Structure of GERA Cohort 1287

analyses When the HGDP samples were included in sub-sequent runs and projected onto the GERA principal compo-nents (PCs) to facilitate geographic interpretation 43988high-performing SNPs were used Initial analyses revealedthat a number of individuals appeared to be discordant be-tween their genetic ancestry and the array to which theywere assigned and the PCA was re-run after reclassifyingthese individuals (see SI Methods File S1)

PC projection approach PCA requires the inversion of a datamatrix which for very large data sets may be computationallychallenging For the East Asian African American and Latinosubgroups in the GERA data set the sample sizes were smallenough so that all subjects within each subgroup were runtogether For example all 7520 East Asian subjects were runtogether in one PCA The white-European American samplehowever is very large and required inverting an 80000 by80000 (64 billion elements) matrix Furthermore the ver-sion of the Smartpca program used at the time of analyseswas not able to analyze the entire European ancestry sampleof 83000 individuals Therefore our approach was toselect a large but manageable number of subjects on whichto perform an initial PCA and then use the resulting SNPloadings to project the remaining subjects

Because we planned to select a random subset of 20000individuals for the initial PCA on which the remainingsubjects would be projected we examined the effect ofusing different subsets by calculating the correlations of theSNP loadings for three different random subsets (SupportingInformation Table S1) The numbers of subjects in the threesubsets were the following 18677 for set 1 20121 for set2 and 17691 for set 3 For the first six PCs there was verygood correlation of the SNP loadings for all three pairs ofsubsets also suggesting that most of the signal regardinggenetic structure is derived from the first six PCs Giventhese results we selected a random set of 20000 Europeanancestry subjects and projected the remaining subjects ontothe PCs obtained

Since the SNPs used for the PCA and admixture estima-tion were common among all four genotyping arrays it waspossible to produce ldquoglobalrdquo PCA scores for the GERA sub-jects Subsets of individuals from the EUR (15500) AFR(3100) EAS (5600) and LAT (3000) arrays were used forthe initial PCA and the remaining subjects were projectedonto these PCs to obtain PC scores for each individual TableS4 shows the number of SNPs remaining after LD and struc-tural variation loci pruning for each of the eight differentPCA runs (File S1)

Genetic ancestryadmixture estimation

To determine individual ancestral admixture proportions inadmixed subjects such as African Americans and Latinos(and others) the full maximum-likelihood software packagefrappe (Tang et al 2005) was used In this analysis individ-ual ancestry proportions are estimated by calculating theprobability of a set of genome-wide genotypes in an individ-

ual as a weighted average of allele frequencies of putativeancestors where the weights represent the admixture pro-portions In general the same HGDP population samplesdescribed above were used to derive allele frequencies forthe ancestral groupsRelationship determination

Relationships were determined using the softwareKING_v14 (Manichaikul et al 2010) with the robust versionthat allows for population substructure KING provides stan-dard thresholds for characterizing monozygotic twinparentndashchild and sibling relationships which we followedIn our data these relationships were clearly separated intodistinct clusters All subjects were included irrespective ofthe array type used for their analysis This analysis wasbased on the 144799 high-performing SNPs common acrossthe four arrays described above

Results

Distribution of raceethnicitynationalitycategories reported

This multi-ethnic cohort includes representation from a broaddistribution of racesethnicitiesnationalities (Table 1) Forindividuals who reported more than one category all catego-ries are included hence the numbers in Table 1 sum togreater than 103006 the total cohort size All of the majorcontinents are represented and many nationalitiesethnicitiesCollapsing the selections into raceethnicity categories (seeMaterials and Methods) of the 106733 total selections3365 (32) include an AfricanAfrican American raceethnicity 7620 (71) include a Latino raceethnicity 575(05) include South Asian raceethnicity 7607 (71)include an East Asian raceethnicity 290 include a PacificIslander raceethnicity (03) 3884 (36) include NativeAmerican raceethnicity and 83392 (781) includea white-European raceethnicity The majority of thoseendorsing a Latino raceethnicity are Mexican and CentralAmerican while the largest groups endorsing an East Asianraceethnicity are Chinese Japanese and Filipino We alsoexamined the sex and age distributions across the differentcategories (Table 1) Compared to those reporting white-European raceethnicity those endorsing AfricanAfro-CaribbeanLatino East Asian and Pacific Islander raceethnicity areyounger with the exception of those reporting Mexican na-tionality the Latino groups tend to have a higher proportionof females as do those reporting Ashkenazi Jewish ethnicitythose reporting South Asian and Middle Eastern nationalitieshave a lower proportion of females

Structure of individuals run on the EUR array

Individuals who self-reported Ashkenazi Middle Easternand non-Hispanic white or European raceethnicity but noother ethnicities were run on the EUR array and analyzedtogether The initial analysis showed as expected a clearAshkenazi cluster and a larger cluster depicting the

1288 Y Banda et al

northwestndashsoutheast European cline (Price et al 2008 Tianet al 2008c) Figure S1A shows those who self-reporteda single ethnicitynationality while Figure S1B shows indi-viduals who self-reported more than one It is evident thatendorsement of more than one ethnicity can imply mixedgenetic ancestry but not automatically Comparing FigureS1 A and B we observe a higher proportion of individualswith mixed genetic ancestry among those who endorsedboth Ashkenazi and European or Middle Eastern ethnicityhowever we still observe a large proportion of nonadmixedindividuals suggesting that endorsement of Ashkenazi andEuropean may reflect a joint perception of ethnicity andcontinent of origin By contrast in Figure S1A we observea substantial number of individuals who appear to haveAshkenazi and European admixture but self-reported a singlecategory only (most often European)

A similar observation can be made about those endorsingMiddle Eastern ethnicity where those endorsing that asa sole response appear to have more Middle Eastern geneticancestry while those endorsing Middle Eastern and Euro-pean ethnicity show more evidence of European geneticancestry However in Figure S1Awe also observe substantialnumbers of individuals reporting only European ethnicitywhose genetic ancestry appears to be Middle Eastern andvice versa Again these reports may reflect recent geo-graphic origin as well as nationalityethnicity

We also repeated the PC analysis after removing theAshkenazi and part-Ashkenazi subjects The PC scores forthe Ashkenazi subjects were then derived by projecting theirgenotypes onto the resulting PCs Individuals reportinga single ethnicitynationality are depicted in Figure S2Awhile those endorsing more than one are displayed in FigureS2B The first PC corresponds to a northwestndashsoutheast clinethrough Europe and the Middle East and the second PCcorresponds to a southwestndashnortheast cline within Europeas has been observed in numerous previous studies (Menozziet al 1978 Sokal et al 1991 Cavalli-Sforza et al 1993Cavalli-Sforza et al 1996 Barbujani and Bertorelle 2001Belle et al 2006 Seldin et al 2006 Bauchet et al 2007Novembre et al 2008 Price et al 2008 Tian et al 2008c)The first and second PCs account for 319 and 134 of thetotal variance of the first 10 PCs respectively

Subjects who self-identified as South Asian were also runon the EUR array and subjected to a separate PCA For thesesubjects to characterize the observed PCs and the relation-ship to geographic ancestry we employed onomastics Inparticular we analyzed surnames to characterize individualsbased on surname geographic region of origin Thesesubjects are mainly of Indian origin and the clusters formedin the PCA depict subgroups from different regions of India(Figure S3) The first PC accounts for 191 of the totalvariance of the first 10 PCs and the second PC accountsfor 100 The analysis also shows that northern Indiansare genetically closer to Europeans (Reich et al 2009) andeastern Indians are genetically more similar to East Asianpopulations As expected those reporting European as well

as South Asian ethnicity are positioned closer in the diagramto the HGDP Europeans

Structure of individuals run on the EAS array

Individuals run on the EAS array included subjects self-reporting European and East Asian raceethnicity and thosereporting solely East Asian raceethnicity The first PC forthese individuals (Figure S4 A and B) is responsible forclustering of individuals with different East AsianndashEuropeanancestry proportions (mostly 50 or 75 European) Thosewith genetic ancestry that is both East Asian and Europeanare most clearly observed in Figure S4B among those self-reporting both racesethnicities and there are very fewGERA individuals in this figure who do not have mixedgenetic ancestry Among individuals reporting only an EastAsian nationality (Figure S4A) the large majority have onlyEast Asian genetic ancestry however there are also individ-uals who appear to have mixed East AsianndashEuropean geneticancestry who self-reported only their East Asian nationalityOf particular interest is the continuous nature of a modestamount of European genetic ancestry in self-identified Fili-pinos consistent with older European admixture The sec-ond PC corresponds to the north-to-south cline in East Asia(Su et al 1999 Tian et al 2008b Hugo Pan-Asian SnpConsortium 2009) and the distinct clusters observed thatrepresent different East Asian nationalities are consistentwith extensive endogamy in these groups The first and sec-ond PCs account for 5971 and 2039 of the total varianceof the first 10 PCs respectively

Individuals endorsing a Pacific Islander ethnicity aredisplayed in Figure S5 Those also reporting an East Asianethnicity appear to cluster more closely to the HGDP EastAsians while those also reporting European ethnicity appearto cluster more closely to the HGDP Europeans While thosereporting Hawaiian and Samoan ethnicity are reasonablywell separated from both the HGDP Europeans and EastAsians some individuals who identified as ldquoother PacificIslanderrdquo appear to overlap quite closely with the HGDP EastAsians Also of interest another subgroup of ldquoother PacificIslandersrdquo appears to form its own cluster at the bottom ofFigure S5 We note that a number of these individuals self-reported both Pacific Islander and South Asian ethnicityBased on onomastics these individuals have Indian sur-names and are likely to be Indo Fijians Approximately375 of the population of Fiji is of Indian origin accordingto the 2007 census (httpwwwstatsfijigovfj) The obser-vation that some Pacific Islanders cluster near to the EastAsians is also an indication that clear separation of geneticancestry for these groups is likely to be challenging

Structure of individuals run on the AFR array

Subjects run on the AFR array revealed as expectedextensive African and European genetic ancestry (FigureS6 A and B) (Parra et al 1998 Fernandez et al 2003 Tanget al 2006 Tishkoff et al 2009 Zakharia et al 2009) Thefirst PC which accounts for 638 of the total variance of

Population Structure of GERA Cohort 1289

the first 10 PCs reflects African vs European genetic ances-try while the second PC denotes East Asian andor NativeAmerican genetic ancestry This is consistent with the arrayassignments whereby individuals reporting both AfricanAfrican American raceethnicity and East Asian or NativeAmerican raceethnicity were assigned to the AFR arrayIndividuals who self-reported African ancestry only werealso subject to onomastics to determine likely countries oforigin We were able to identify subjects of Ethiopian Eri-trean and Kenyan nationality For the Kenyans Figure S6Aindicates a location consistent with 100 African geneticancestry By contrast the EthiopianEritrean subjects occupyan intermediate position on the PC1 axis suggesting prox-imity to EuropeanMiddle Eastern populations Also of noteis the modest variation in their PC1 scores This is likely dueto ancient admixture with Middle Eastern populations(Hodgson et al 2014) These results confirm that Ethiopianshave a unique genetic structure among African populations

Individuals self-reporting mixed African and East Asianraceethnicity generally reflect that admixture from thegenetic perspective as well (Figure S6B) however a numberof individuals who reported only African American ethnicityalso appear to have similar levels of East Asian admixture(Figure S6A) Those reporting both African American andEuropean ethnicity generally occupy a position on the PC1axis closer to Europeans than those who do not (Figure S6B)

The mean African ancestry proportion in this sample is736 6 174 There is a reasonably high level of variationin the African genetic ancestry proportion ranging from106 to 100

Structure of individuals run on the LAT array

Latinos may have ancestry deriving from multiple conti-nents including Europe Africa Asia and the Americas(Bonilla et al 2004 Tang et al 2006 2007) Figure S7Aprovides the PCA results for all those who endorsed Latinoor Native American as their sole raceethnicity PC1 repre-sents the European vs Native American axis of geneticvariation and PC2 represents the African axis of geneticvariation PC1 and PC2 account for 7095 and 1157 of thetotal variance of the first 10 PCs respectively Nearly allLatinos show evidence of EuropeanWest Asian geneticancestry and a substantial subset also show evidence ofAfrican genetic ancestry Similarly all individuals self-reporting Native American raceethnicity show some degreeof EuropeanWest Asian genetic ancestry Latinos of differentnationalities exhibit varying proportions of European Africanand Native American ancestries (Figure S7B) Those reportingMexican and Central-South American nationality have geneticancestry that is primarily European and Native American withslight but varying amounts of African ancestry Those report-ing Cuban nationality have primarily European genetic ances-try with a small number of individuals having primarilyAfrican genetic ancestry Those reporting Puerto Rican nation-ality show some Native American genetic ancestry but areprimarily admixed between European and African genetic an-

cestry Individual ancestral admixture proportions were deter-mined for these subjects and are provided in Table S5

The LAT array also included a variety of individuals whoself-reported more than one raceethnicity These individu-als are represented in Figure S7C Individuals who reportedEuropean as well as Latino raceethnicity tend to haveslightly more European genetic ancestry than those whodid not similarly a number of individuals who reportedAfricanAfrican American raceethnicity in addition toLatino raceethnicity have substantial African genetic ances-try however many such individuals also appear to have thesame modest degree of African genetic ancestry as thosewho reported only a Latino raceethnicity Those whoreported Native American raceethnicity in addition to La-tino raceethnicity also appear to have slightly increasedNative American genetic ancestry Those who reported Eu-ropean and Native American raceethnicity appear to besimilar to those who solely reported Native Americanraceethnicity all have EuropeanWest Asian genetic ances-try and while some show evidence of Native American ge-netic ancestry EuropeanWest Asian is the sole or primarygenetic ancestry for the majority For those with 100 Eu-ropean genetic ancestry and who self-reported only Euro-pean and Native American raceethnicity (n = 2155) wealso calculated European PCs Finally those who reportedEast Asian in addition to Latino raceethnicity generallyhave evidence of East Asian genetic ancestry (as observedin Figure S7C by proximity to the HGDP East Asians) rang-ing from 25 to 50 and 100

Global PCA for GERA subjects

Figure S8 shows that the first PC mainly separates Euro-peans from East Asians (and Native Americans) and PC2separates Africans from all the other groups PC3 seems toseparate Native Americans from the other groups and PC4also separates Native Americans from the other groups butalso shows some separation among the Europeans PC5 sep-arates the different East Asian groups (mainly north vssouth) and also East Asians from Oceania and PC6 sepa-rates CentralndashSouth Asians from the other groups PC7again separates the various East Asian regions and PC8separates the European groups (mainly north to south)PC9 and PC10 separate East Asians from Oceania but alsothe Russians (not labeled) are separated from the otherEuropean groups

Relationship between self-reported raceethnicity andgenetic ancestry

Table S6 displays the full relationship of self-reported raceethnicity to genetic ancestry for the six continental geneticancestries of EuropeWest Asia Africa East Asia PacificIslands South Asia and the Americas A genetic continentalancestry was assigned to an individual if herhis estimatefor that ancestry was at least 5 A total of 91502 indi-viduals (939) reported a single raceethnicity 5475individuals reported two racesethnicities (59) and 512

1290 Y Banda et al

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 2: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

to confounding by ancestry can be improved when popula-tion structure is taken into account (Tian et al 2008a)At the same time the relationship between self-identifiedraceethnicitynationality and genetic ancestry based ongenetic marker data has become a topic of great interest(Risch et al 2002 Burchard et al 2003 Cooper et al2003)

Studies of human evolution have typically focused onindigenous population samples broadly distributed geograph-ically across the globe One such resource that has been highlyexploited for this purpose is the Human Genome DiversityProject panel of 55 indigenous populations (Jakobsson et al2008 Li et al 2008) On the other hand GWAS utilizing US-based samples often include more heterogeneous populationsin terms of ancestry although the number of ethnic groupsincluded is typically limited

In the present study we utilize the large ethnicallydiverse Kaiser Permanente (KP) Research Program onGenes Environment and Health (RPGEH) Genetic Epide-miology Research on Adult Health and Aging (GERA)cohort to examine the question of genetic ancestry ina representative northern California population and how itrelates to racialethnic self-identification The cohort con-sists of 103006 adult members of Kaiser PermanenteNorthern California (KPNC) ranging in age from 18 to100 years at enrollment The cohort was created to enablestudies of genetic and environmental influences on manydifferent health conditions and traits by linking high-density genome-wide SNP data with comprehensive longi-tudinal clinical information from electronic health records(EHR) as well as self-reported data on demographic factorsand health behaviors from a structured survey The GERAcohort is one of the first very large multi-ethnic cohortscreated for GWAS for a wide variety of health conditionsThe cohort was genotyped using custom ancestry-specificSNP arrays to better capture rare variants specific todifferent ethnic groups and provide better genome-widecoverage thus permitting investigation of potential asso-ciations that may differ between groups Understandingand characterizing the genetic diversity within a sample isessential to GWAS since population structure both withinand between groups can lead to artifactual associationsThe multi-ethnic GERA cohort thus provides an unprece-dented opportunity to understand human genetic diversityin a US population sample This article presents theresults of analyses of population genetic structure con-firming previous observations but also adding furtherunderstanding of mixed genetic ancestry including theextent of distant vs recent admixture We also provideestimates of principal components needed for adjustmentof population structure in GWAS and examine the self-reported raceethnicity distribution of first-degree relative(parentndashchild and full sib) and monozygotic (MZ) twinpairs Finally we examine how the identified genetic struc-ture correlates with participantsrsquo self-identification in termsof raceethnicitynationality

Materials and Methods

Participants

Individuals composing the GERA cohort are participants inthe KPNC RPGEH KPNC is an integrated health care deliverysystem with 3 million members in northern California Themembership is representative of the general population withrespect to raceethnicity and socioeconomic status althoughextremes of income are under-represented (Krieger et al1993) The RPGEH was established as a resource for researchon genetic and environmental influences on health and dis-ease The development of the RPGEH and GERA cohort aredescribed elsewhere (dbGaP phs000674v1p1) Briefly adultmembers of KPNC were asked to complete a mailed surveysurvey respondents then completed a broad written consentand provided a saliva sample for extraction of DNA Partic-ipants self-reported their race ethnicity and nationality onthe survey by endorsing as many of 23 race ethnicity andnationality categories as applied (Table 1 provides a list of thechoices) Participants were asked their religion and this ques-tion in conjunction with a raceethnicitynationality questionwas used to identify Ashkenazi individuals (those whoresponded ldquoAshkenazi Jewishrdquo to the nationality question orldquoJewishrdquo to the religion question)

To maximize the diversity of the sample the GERA cohortwas formed by including all racial and ethnic minorityparticipants with saliva samples (19 of the total) theremaining participants were drawn randomly from whitenon-Hispanic participants (81 of the total) Among cohortmembers the average length of KPNC health plan member-ship was 23 years providing extensive longitudinal data ondiagnoses and procedures laboratory test results pharmaceu-tical prescriptions radiological findings and other clinical in-formation from EHR for use in GWAS of health conditions andtraits

The Human Genome Diversity Project (HGDP) (Cavalli-Sforza 2005 Li et al 2008) subjects were used to facili-tate geographic interpretation of the GERA principalcomponents

Self-reported raceethnicity

Self-reported raceethnicity for each individual was derivedfrom responses to the survey question on raceethnicitynationality (Table 1) Nationalities within a single raceethnicity group were collapsed Specifically all East Asiannationalities (codes 10ndash15) were collapsed into a single EastAsian group all Pacific Islander nationalities (codes 16ndash18)were collapsed into a single Pacific Islander group all Latinonationalities (codes 4ndash8) were collapsed into a single Latinocategory all African descent populations (codes 1ndash3) werecollapsed into a single group all white-European ethnicities(codes 20ndash22) were collapsed into a single category thesingle categories of South Asians and Native Americansremained as such A small number of individuals (1)had implausible raceethnicity responses from the survey(eg checked off every category) or specified ldquootherrdquo For

1286 Y Banda et al

these individuals we used KPNC administrative databases toassign raceethnicity For other individuals a discrepancywas observed between their original and scanned surveyresponses These subjects were also adjudicated to theiroriginal form results as described in SI Methods (File S1)

Genotyping and array assignment

To maximize genome-wide coverage of common and lesscommon variants four custom Affymetrix Axiom arrays(Hoffmann et al 2011ab) were designed for individuals ofnon-Hispanic white (EUR) East Asian (EAS) African Ameri-can (AFR) and Latino (LAT) raceethnicity The number ofSNPs varied among arrays ranging from 674518 on the EURarray to 893631 on the AFR array (Hoffmann et al 2011b)A total of 254438 SNPs were common to all four arraysGenotyping was performed at the University of CaliforniaSan Francisco and is described elsewhere (Kvale et al 2015)

The assignment of subjects to arrays was based on theraceethnicity categories formed as described above (Table S2and Table S3) Assignments were hierarchical to accommo-date individuals reporting multiple racialethnic categoriesSpecifically individuals reporting any Latino or Native Amer-ican raceethnicitynationality (possibly in combination withother racesethnicitiesnationalities) were assigned to the LATarray with the exception of individuals who reported AfricanAfrican American raceethnicity and Native American raceethnicity who were assigned to the AFR array and individualsreporting East Asian raceethnicity and Native American raceethnicity who were assigned to the EAS array All other indi-viduals reporting any African African American or Afro-Caribbean raceethnicity but no Latino raceethnicity wereassigned to the AFR array All those reporting any East Asian

but not African African American Afro-Caribbean or Latinoraceethnicity were assigned to the EAS array Subjects report-ing white-European American South Asian Middle Easternor Ashkenazi raceethnicity but none of the previously men-tioned racesethnicities were assigned to the EUR arrayTherefore for example individuals with European and EastAsian raceethnicity were assigned to the EAS array individ-uals with African American and East Asian raceethnicitynationality were analyzed on the AFR array The variousarrays were designed to allow for the relevant admixture(Hoffmann et al 2011b)

Quality control

High-quality genotype data for the GERA cohort wasobtained by systematic examination and removal of SNPgenotypes according to a specific protocol as described indetail elsewhere (Kvale et al 2015) For the genetic struc-ture analyses only SNPs that were common across all fourarrays and that had a call rate 995 were consideredThis set also excluded SNPs that showed extreme deviationfrom HardyndashWeinberg equilibrium (P 1025) This resultedin a set of 144799 high-performing SNPs used in furtheranalyses of population structure and admixture

Principal components analysis

Filtering Principal components analysis (PCA) was per-formed using the smartpca program which is part of theEIGENSOFT42 software package (Patterson et al 2006)The initial PCA runs were performed separately for individ-uals genotyped on different arrays The initial set of 144799high-performing SNPs (described above) that were commonacross all four array types was used in the preliminary

Table 1 Distribution of responses to survey question on raceethnicitynationality along with proportion female and average ages

Category No female Mean age (SE)

1 African American 3117 057 6066 (024)2 African 129 043 5290 (143)3 Afro-Caribbean 119 068 5624 (130)4 Mexican 4613 056 5667 (022)5 Central-South American 1034 070 5534 (046)6 Puerto Rican 322 069 5668 (083)7 Cuban 106 071 5541 (142)8 Other LatinoHispanic 1545 070 5741 (038)9 South AsianndashIndianPakistani 575 042 5458 (060)10 Chinese 3433 058 5675 (025)11 Japanese 1739 061 6156 (034)12 Korean 234 066 5383 (104)13 Filipino 1708 059 5559 (037)14 Vietnamese 317 050 5323 (082)15 Other Southeast Asia 176 064 5185 (110)16 Native Hawaiian 144 065 5841 (123)17 Samoan 14 064 5936 (344)18 Other Pacific Islander 132 057 5388 (135)19 Native American IndianAlaska Native 3884 066 6120 (022)20 White European American 80079 059 6327 (005)21 Middle Easterner 914 043 6218 (048)22 Ashkenazi Jewish 2399 066 6249 (028)23 Other ethnicity 75 073 5653 (164)

Population Structure of GERA Cohort 1287

analyses When the HGDP samples were included in sub-sequent runs and projected onto the GERA principal compo-nents (PCs) to facilitate geographic interpretation 43988high-performing SNPs were used Initial analyses revealedthat a number of individuals appeared to be discordant be-tween their genetic ancestry and the array to which theywere assigned and the PCA was re-run after reclassifyingthese individuals (see SI Methods File S1)

PC projection approach PCA requires the inversion of a datamatrix which for very large data sets may be computationallychallenging For the East Asian African American and Latinosubgroups in the GERA data set the sample sizes were smallenough so that all subjects within each subgroup were runtogether For example all 7520 East Asian subjects were runtogether in one PCA The white-European American samplehowever is very large and required inverting an 80000 by80000 (64 billion elements) matrix Furthermore the ver-sion of the Smartpca program used at the time of analyseswas not able to analyze the entire European ancestry sampleof 83000 individuals Therefore our approach was toselect a large but manageable number of subjects on whichto perform an initial PCA and then use the resulting SNPloadings to project the remaining subjects

Because we planned to select a random subset of 20000individuals for the initial PCA on which the remainingsubjects would be projected we examined the effect ofusing different subsets by calculating the correlations of theSNP loadings for three different random subsets (SupportingInformation Table S1) The numbers of subjects in the threesubsets were the following 18677 for set 1 20121 for set2 and 17691 for set 3 For the first six PCs there was verygood correlation of the SNP loadings for all three pairs ofsubsets also suggesting that most of the signal regardinggenetic structure is derived from the first six PCs Giventhese results we selected a random set of 20000 Europeanancestry subjects and projected the remaining subjects ontothe PCs obtained

Since the SNPs used for the PCA and admixture estima-tion were common among all four genotyping arrays it waspossible to produce ldquoglobalrdquo PCA scores for the GERA sub-jects Subsets of individuals from the EUR (15500) AFR(3100) EAS (5600) and LAT (3000) arrays were used forthe initial PCA and the remaining subjects were projectedonto these PCs to obtain PC scores for each individual TableS4 shows the number of SNPs remaining after LD and struc-tural variation loci pruning for each of the eight differentPCA runs (File S1)

Genetic ancestryadmixture estimation

To determine individual ancestral admixture proportions inadmixed subjects such as African Americans and Latinos(and others) the full maximum-likelihood software packagefrappe (Tang et al 2005) was used In this analysis individ-ual ancestry proportions are estimated by calculating theprobability of a set of genome-wide genotypes in an individ-

ual as a weighted average of allele frequencies of putativeancestors where the weights represent the admixture pro-portions In general the same HGDP population samplesdescribed above were used to derive allele frequencies forthe ancestral groupsRelationship determination

Relationships were determined using the softwareKING_v14 (Manichaikul et al 2010) with the robust versionthat allows for population substructure KING provides stan-dard thresholds for characterizing monozygotic twinparentndashchild and sibling relationships which we followedIn our data these relationships were clearly separated intodistinct clusters All subjects were included irrespective ofthe array type used for their analysis This analysis wasbased on the 144799 high-performing SNPs common acrossthe four arrays described above

Results

Distribution of raceethnicitynationalitycategories reported

This multi-ethnic cohort includes representation from a broaddistribution of racesethnicitiesnationalities (Table 1) Forindividuals who reported more than one category all catego-ries are included hence the numbers in Table 1 sum togreater than 103006 the total cohort size All of the majorcontinents are represented and many nationalitiesethnicitiesCollapsing the selections into raceethnicity categories (seeMaterials and Methods) of the 106733 total selections3365 (32) include an AfricanAfrican American raceethnicity 7620 (71) include a Latino raceethnicity 575(05) include South Asian raceethnicity 7607 (71)include an East Asian raceethnicity 290 include a PacificIslander raceethnicity (03) 3884 (36) include NativeAmerican raceethnicity and 83392 (781) includea white-European raceethnicity The majority of thoseendorsing a Latino raceethnicity are Mexican and CentralAmerican while the largest groups endorsing an East Asianraceethnicity are Chinese Japanese and Filipino We alsoexamined the sex and age distributions across the differentcategories (Table 1) Compared to those reporting white-European raceethnicity those endorsing AfricanAfro-CaribbeanLatino East Asian and Pacific Islander raceethnicity areyounger with the exception of those reporting Mexican na-tionality the Latino groups tend to have a higher proportionof females as do those reporting Ashkenazi Jewish ethnicitythose reporting South Asian and Middle Eastern nationalitieshave a lower proportion of females

Structure of individuals run on the EUR array

Individuals who self-reported Ashkenazi Middle Easternand non-Hispanic white or European raceethnicity but noother ethnicities were run on the EUR array and analyzedtogether The initial analysis showed as expected a clearAshkenazi cluster and a larger cluster depicting the

1288 Y Banda et al

northwestndashsoutheast European cline (Price et al 2008 Tianet al 2008c) Figure S1A shows those who self-reporteda single ethnicitynationality while Figure S1B shows indi-viduals who self-reported more than one It is evident thatendorsement of more than one ethnicity can imply mixedgenetic ancestry but not automatically Comparing FigureS1 A and B we observe a higher proportion of individualswith mixed genetic ancestry among those who endorsedboth Ashkenazi and European or Middle Eastern ethnicityhowever we still observe a large proportion of nonadmixedindividuals suggesting that endorsement of Ashkenazi andEuropean may reflect a joint perception of ethnicity andcontinent of origin By contrast in Figure S1A we observea substantial number of individuals who appear to haveAshkenazi and European admixture but self-reported a singlecategory only (most often European)

A similar observation can be made about those endorsingMiddle Eastern ethnicity where those endorsing that asa sole response appear to have more Middle Eastern geneticancestry while those endorsing Middle Eastern and Euro-pean ethnicity show more evidence of European geneticancestry However in Figure S1Awe also observe substantialnumbers of individuals reporting only European ethnicitywhose genetic ancestry appears to be Middle Eastern andvice versa Again these reports may reflect recent geo-graphic origin as well as nationalityethnicity

We also repeated the PC analysis after removing theAshkenazi and part-Ashkenazi subjects The PC scores forthe Ashkenazi subjects were then derived by projecting theirgenotypes onto the resulting PCs Individuals reportinga single ethnicitynationality are depicted in Figure S2Awhile those endorsing more than one are displayed in FigureS2B The first PC corresponds to a northwestndashsoutheast clinethrough Europe and the Middle East and the second PCcorresponds to a southwestndashnortheast cline within Europeas has been observed in numerous previous studies (Menozziet al 1978 Sokal et al 1991 Cavalli-Sforza et al 1993Cavalli-Sforza et al 1996 Barbujani and Bertorelle 2001Belle et al 2006 Seldin et al 2006 Bauchet et al 2007Novembre et al 2008 Price et al 2008 Tian et al 2008c)The first and second PCs account for 319 and 134 of thetotal variance of the first 10 PCs respectively

Subjects who self-identified as South Asian were also runon the EUR array and subjected to a separate PCA For thesesubjects to characterize the observed PCs and the relation-ship to geographic ancestry we employed onomastics Inparticular we analyzed surnames to characterize individualsbased on surname geographic region of origin Thesesubjects are mainly of Indian origin and the clusters formedin the PCA depict subgroups from different regions of India(Figure S3) The first PC accounts for 191 of the totalvariance of the first 10 PCs and the second PC accountsfor 100 The analysis also shows that northern Indiansare genetically closer to Europeans (Reich et al 2009) andeastern Indians are genetically more similar to East Asianpopulations As expected those reporting European as well

as South Asian ethnicity are positioned closer in the diagramto the HGDP Europeans

Structure of individuals run on the EAS array

Individuals run on the EAS array included subjects self-reporting European and East Asian raceethnicity and thosereporting solely East Asian raceethnicity The first PC forthese individuals (Figure S4 A and B) is responsible forclustering of individuals with different East AsianndashEuropeanancestry proportions (mostly 50 or 75 European) Thosewith genetic ancestry that is both East Asian and Europeanare most clearly observed in Figure S4B among those self-reporting both racesethnicities and there are very fewGERA individuals in this figure who do not have mixedgenetic ancestry Among individuals reporting only an EastAsian nationality (Figure S4A) the large majority have onlyEast Asian genetic ancestry however there are also individ-uals who appear to have mixed East AsianndashEuropean geneticancestry who self-reported only their East Asian nationalityOf particular interest is the continuous nature of a modestamount of European genetic ancestry in self-identified Fili-pinos consistent with older European admixture The sec-ond PC corresponds to the north-to-south cline in East Asia(Su et al 1999 Tian et al 2008b Hugo Pan-Asian SnpConsortium 2009) and the distinct clusters observed thatrepresent different East Asian nationalities are consistentwith extensive endogamy in these groups The first and sec-ond PCs account for 5971 and 2039 of the total varianceof the first 10 PCs respectively

Individuals endorsing a Pacific Islander ethnicity aredisplayed in Figure S5 Those also reporting an East Asianethnicity appear to cluster more closely to the HGDP EastAsians while those also reporting European ethnicity appearto cluster more closely to the HGDP Europeans While thosereporting Hawaiian and Samoan ethnicity are reasonablywell separated from both the HGDP Europeans and EastAsians some individuals who identified as ldquoother PacificIslanderrdquo appear to overlap quite closely with the HGDP EastAsians Also of interest another subgroup of ldquoother PacificIslandersrdquo appears to form its own cluster at the bottom ofFigure S5 We note that a number of these individuals self-reported both Pacific Islander and South Asian ethnicityBased on onomastics these individuals have Indian sur-names and are likely to be Indo Fijians Approximately375 of the population of Fiji is of Indian origin accordingto the 2007 census (httpwwwstatsfijigovfj) The obser-vation that some Pacific Islanders cluster near to the EastAsians is also an indication that clear separation of geneticancestry for these groups is likely to be challenging

Structure of individuals run on the AFR array

Subjects run on the AFR array revealed as expectedextensive African and European genetic ancestry (FigureS6 A and B) (Parra et al 1998 Fernandez et al 2003 Tanget al 2006 Tishkoff et al 2009 Zakharia et al 2009) Thefirst PC which accounts for 638 of the total variance of

Population Structure of GERA Cohort 1289

the first 10 PCs reflects African vs European genetic ances-try while the second PC denotes East Asian andor NativeAmerican genetic ancestry This is consistent with the arrayassignments whereby individuals reporting both AfricanAfrican American raceethnicity and East Asian or NativeAmerican raceethnicity were assigned to the AFR arrayIndividuals who self-reported African ancestry only werealso subject to onomastics to determine likely countries oforigin We were able to identify subjects of Ethiopian Eri-trean and Kenyan nationality For the Kenyans Figure S6Aindicates a location consistent with 100 African geneticancestry By contrast the EthiopianEritrean subjects occupyan intermediate position on the PC1 axis suggesting prox-imity to EuropeanMiddle Eastern populations Also of noteis the modest variation in their PC1 scores This is likely dueto ancient admixture with Middle Eastern populations(Hodgson et al 2014) These results confirm that Ethiopianshave a unique genetic structure among African populations

Individuals self-reporting mixed African and East Asianraceethnicity generally reflect that admixture from thegenetic perspective as well (Figure S6B) however a numberof individuals who reported only African American ethnicityalso appear to have similar levels of East Asian admixture(Figure S6A) Those reporting both African American andEuropean ethnicity generally occupy a position on the PC1axis closer to Europeans than those who do not (Figure S6B)

The mean African ancestry proportion in this sample is736 6 174 There is a reasonably high level of variationin the African genetic ancestry proportion ranging from106 to 100

Structure of individuals run on the LAT array

Latinos may have ancestry deriving from multiple conti-nents including Europe Africa Asia and the Americas(Bonilla et al 2004 Tang et al 2006 2007) Figure S7Aprovides the PCA results for all those who endorsed Latinoor Native American as their sole raceethnicity PC1 repre-sents the European vs Native American axis of geneticvariation and PC2 represents the African axis of geneticvariation PC1 and PC2 account for 7095 and 1157 of thetotal variance of the first 10 PCs respectively Nearly allLatinos show evidence of EuropeanWest Asian geneticancestry and a substantial subset also show evidence ofAfrican genetic ancestry Similarly all individuals self-reporting Native American raceethnicity show some degreeof EuropeanWest Asian genetic ancestry Latinos of differentnationalities exhibit varying proportions of European Africanand Native American ancestries (Figure S7B) Those reportingMexican and Central-South American nationality have geneticancestry that is primarily European and Native American withslight but varying amounts of African ancestry Those report-ing Cuban nationality have primarily European genetic ances-try with a small number of individuals having primarilyAfrican genetic ancestry Those reporting Puerto Rican nation-ality show some Native American genetic ancestry but areprimarily admixed between European and African genetic an-

cestry Individual ancestral admixture proportions were deter-mined for these subjects and are provided in Table S5

The LAT array also included a variety of individuals whoself-reported more than one raceethnicity These individu-als are represented in Figure S7C Individuals who reportedEuropean as well as Latino raceethnicity tend to haveslightly more European genetic ancestry than those whodid not similarly a number of individuals who reportedAfricanAfrican American raceethnicity in addition toLatino raceethnicity have substantial African genetic ances-try however many such individuals also appear to have thesame modest degree of African genetic ancestry as thosewho reported only a Latino raceethnicity Those whoreported Native American raceethnicity in addition to La-tino raceethnicity also appear to have slightly increasedNative American genetic ancestry Those who reported Eu-ropean and Native American raceethnicity appear to besimilar to those who solely reported Native Americanraceethnicity all have EuropeanWest Asian genetic ances-try and while some show evidence of Native American ge-netic ancestry EuropeanWest Asian is the sole or primarygenetic ancestry for the majority For those with 100 Eu-ropean genetic ancestry and who self-reported only Euro-pean and Native American raceethnicity (n = 2155) wealso calculated European PCs Finally those who reportedEast Asian in addition to Latino raceethnicity generallyhave evidence of East Asian genetic ancestry (as observedin Figure S7C by proximity to the HGDP East Asians) rang-ing from 25 to 50 and 100

Global PCA for GERA subjects

Figure S8 shows that the first PC mainly separates Euro-peans from East Asians (and Native Americans) and PC2separates Africans from all the other groups PC3 seems toseparate Native Americans from the other groups and PC4also separates Native Americans from the other groups butalso shows some separation among the Europeans PC5 sep-arates the different East Asian groups (mainly north vssouth) and also East Asians from Oceania and PC6 sepa-rates CentralndashSouth Asians from the other groups PC7again separates the various East Asian regions and PC8separates the European groups (mainly north to south)PC9 and PC10 separate East Asians from Oceania but alsothe Russians (not labeled) are separated from the otherEuropean groups

Relationship between self-reported raceethnicity andgenetic ancestry

Table S6 displays the full relationship of self-reported raceethnicity to genetic ancestry for the six continental geneticancestries of EuropeWest Asia Africa East Asia PacificIslands South Asia and the Americas A genetic continentalancestry was assigned to an individual if herhis estimatefor that ancestry was at least 5 A total of 91502 indi-viduals (939) reported a single raceethnicity 5475individuals reported two racesethnicities (59) and 512

1290 Y Banda et al

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 3: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

these individuals we used KPNC administrative databases toassign raceethnicity For other individuals a discrepancywas observed between their original and scanned surveyresponses These subjects were also adjudicated to theiroriginal form results as described in SI Methods (File S1)

Genotyping and array assignment

To maximize genome-wide coverage of common and lesscommon variants four custom Affymetrix Axiom arrays(Hoffmann et al 2011ab) were designed for individuals ofnon-Hispanic white (EUR) East Asian (EAS) African Ameri-can (AFR) and Latino (LAT) raceethnicity The number ofSNPs varied among arrays ranging from 674518 on the EURarray to 893631 on the AFR array (Hoffmann et al 2011b)A total of 254438 SNPs were common to all four arraysGenotyping was performed at the University of CaliforniaSan Francisco and is described elsewhere (Kvale et al 2015)

The assignment of subjects to arrays was based on theraceethnicity categories formed as described above (Table S2and Table S3) Assignments were hierarchical to accommo-date individuals reporting multiple racialethnic categoriesSpecifically individuals reporting any Latino or Native Amer-ican raceethnicitynationality (possibly in combination withother racesethnicitiesnationalities) were assigned to the LATarray with the exception of individuals who reported AfricanAfrican American raceethnicity and Native American raceethnicity who were assigned to the AFR array and individualsreporting East Asian raceethnicity and Native American raceethnicity who were assigned to the EAS array All other indi-viduals reporting any African African American or Afro-Caribbean raceethnicity but no Latino raceethnicity wereassigned to the AFR array All those reporting any East Asian

but not African African American Afro-Caribbean or Latinoraceethnicity were assigned to the EAS array Subjects report-ing white-European American South Asian Middle Easternor Ashkenazi raceethnicity but none of the previously men-tioned racesethnicities were assigned to the EUR arrayTherefore for example individuals with European and EastAsian raceethnicity were assigned to the EAS array individ-uals with African American and East Asian raceethnicitynationality were analyzed on the AFR array The variousarrays were designed to allow for the relevant admixture(Hoffmann et al 2011b)

Quality control

High-quality genotype data for the GERA cohort wasobtained by systematic examination and removal of SNPgenotypes according to a specific protocol as described indetail elsewhere (Kvale et al 2015) For the genetic struc-ture analyses only SNPs that were common across all fourarrays and that had a call rate 995 were consideredThis set also excluded SNPs that showed extreme deviationfrom HardyndashWeinberg equilibrium (P 1025) This resultedin a set of 144799 high-performing SNPs used in furtheranalyses of population structure and admixture

Principal components analysis

Filtering Principal components analysis (PCA) was per-formed using the smartpca program which is part of theEIGENSOFT42 software package (Patterson et al 2006)The initial PCA runs were performed separately for individ-uals genotyped on different arrays The initial set of 144799high-performing SNPs (described above) that were commonacross all four array types was used in the preliminary

Table 1 Distribution of responses to survey question on raceethnicitynationality along with proportion female and average ages

Category No female Mean age (SE)

1 African American 3117 057 6066 (024)2 African 129 043 5290 (143)3 Afro-Caribbean 119 068 5624 (130)4 Mexican 4613 056 5667 (022)5 Central-South American 1034 070 5534 (046)6 Puerto Rican 322 069 5668 (083)7 Cuban 106 071 5541 (142)8 Other LatinoHispanic 1545 070 5741 (038)9 South AsianndashIndianPakistani 575 042 5458 (060)10 Chinese 3433 058 5675 (025)11 Japanese 1739 061 6156 (034)12 Korean 234 066 5383 (104)13 Filipino 1708 059 5559 (037)14 Vietnamese 317 050 5323 (082)15 Other Southeast Asia 176 064 5185 (110)16 Native Hawaiian 144 065 5841 (123)17 Samoan 14 064 5936 (344)18 Other Pacific Islander 132 057 5388 (135)19 Native American IndianAlaska Native 3884 066 6120 (022)20 White European American 80079 059 6327 (005)21 Middle Easterner 914 043 6218 (048)22 Ashkenazi Jewish 2399 066 6249 (028)23 Other ethnicity 75 073 5653 (164)

Population Structure of GERA Cohort 1287

analyses When the HGDP samples were included in sub-sequent runs and projected onto the GERA principal compo-nents (PCs) to facilitate geographic interpretation 43988high-performing SNPs were used Initial analyses revealedthat a number of individuals appeared to be discordant be-tween their genetic ancestry and the array to which theywere assigned and the PCA was re-run after reclassifyingthese individuals (see SI Methods File S1)

PC projection approach PCA requires the inversion of a datamatrix which for very large data sets may be computationallychallenging For the East Asian African American and Latinosubgroups in the GERA data set the sample sizes were smallenough so that all subjects within each subgroup were runtogether For example all 7520 East Asian subjects were runtogether in one PCA The white-European American samplehowever is very large and required inverting an 80000 by80000 (64 billion elements) matrix Furthermore the ver-sion of the Smartpca program used at the time of analyseswas not able to analyze the entire European ancestry sampleof 83000 individuals Therefore our approach was toselect a large but manageable number of subjects on whichto perform an initial PCA and then use the resulting SNPloadings to project the remaining subjects

Because we planned to select a random subset of 20000individuals for the initial PCA on which the remainingsubjects would be projected we examined the effect ofusing different subsets by calculating the correlations of theSNP loadings for three different random subsets (SupportingInformation Table S1) The numbers of subjects in the threesubsets were the following 18677 for set 1 20121 for set2 and 17691 for set 3 For the first six PCs there was verygood correlation of the SNP loadings for all three pairs ofsubsets also suggesting that most of the signal regardinggenetic structure is derived from the first six PCs Giventhese results we selected a random set of 20000 Europeanancestry subjects and projected the remaining subjects ontothe PCs obtained

Since the SNPs used for the PCA and admixture estima-tion were common among all four genotyping arrays it waspossible to produce ldquoglobalrdquo PCA scores for the GERA sub-jects Subsets of individuals from the EUR (15500) AFR(3100) EAS (5600) and LAT (3000) arrays were used forthe initial PCA and the remaining subjects were projectedonto these PCs to obtain PC scores for each individual TableS4 shows the number of SNPs remaining after LD and struc-tural variation loci pruning for each of the eight differentPCA runs (File S1)

Genetic ancestryadmixture estimation

To determine individual ancestral admixture proportions inadmixed subjects such as African Americans and Latinos(and others) the full maximum-likelihood software packagefrappe (Tang et al 2005) was used In this analysis individ-ual ancestry proportions are estimated by calculating theprobability of a set of genome-wide genotypes in an individ-

ual as a weighted average of allele frequencies of putativeancestors where the weights represent the admixture pro-portions In general the same HGDP population samplesdescribed above were used to derive allele frequencies forthe ancestral groupsRelationship determination

Relationships were determined using the softwareKING_v14 (Manichaikul et al 2010) with the robust versionthat allows for population substructure KING provides stan-dard thresholds for characterizing monozygotic twinparentndashchild and sibling relationships which we followedIn our data these relationships were clearly separated intodistinct clusters All subjects were included irrespective ofthe array type used for their analysis This analysis wasbased on the 144799 high-performing SNPs common acrossthe four arrays described above

Results

Distribution of raceethnicitynationalitycategories reported

This multi-ethnic cohort includes representation from a broaddistribution of racesethnicitiesnationalities (Table 1) Forindividuals who reported more than one category all catego-ries are included hence the numbers in Table 1 sum togreater than 103006 the total cohort size All of the majorcontinents are represented and many nationalitiesethnicitiesCollapsing the selections into raceethnicity categories (seeMaterials and Methods) of the 106733 total selections3365 (32) include an AfricanAfrican American raceethnicity 7620 (71) include a Latino raceethnicity 575(05) include South Asian raceethnicity 7607 (71)include an East Asian raceethnicity 290 include a PacificIslander raceethnicity (03) 3884 (36) include NativeAmerican raceethnicity and 83392 (781) includea white-European raceethnicity The majority of thoseendorsing a Latino raceethnicity are Mexican and CentralAmerican while the largest groups endorsing an East Asianraceethnicity are Chinese Japanese and Filipino We alsoexamined the sex and age distributions across the differentcategories (Table 1) Compared to those reporting white-European raceethnicity those endorsing AfricanAfro-CaribbeanLatino East Asian and Pacific Islander raceethnicity areyounger with the exception of those reporting Mexican na-tionality the Latino groups tend to have a higher proportionof females as do those reporting Ashkenazi Jewish ethnicitythose reporting South Asian and Middle Eastern nationalitieshave a lower proportion of females

Structure of individuals run on the EUR array

Individuals who self-reported Ashkenazi Middle Easternand non-Hispanic white or European raceethnicity but noother ethnicities were run on the EUR array and analyzedtogether The initial analysis showed as expected a clearAshkenazi cluster and a larger cluster depicting the

1288 Y Banda et al

northwestndashsoutheast European cline (Price et al 2008 Tianet al 2008c) Figure S1A shows those who self-reporteda single ethnicitynationality while Figure S1B shows indi-viduals who self-reported more than one It is evident thatendorsement of more than one ethnicity can imply mixedgenetic ancestry but not automatically Comparing FigureS1 A and B we observe a higher proportion of individualswith mixed genetic ancestry among those who endorsedboth Ashkenazi and European or Middle Eastern ethnicityhowever we still observe a large proportion of nonadmixedindividuals suggesting that endorsement of Ashkenazi andEuropean may reflect a joint perception of ethnicity andcontinent of origin By contrast in Figure S1A we observea substantial number of individuals who appear to haveAshkenazi and European admixture but self-reported a singlecategory only (most often European)

A similar observation can be made about those endorsingMiddle Eastern ethnicity where those endorsing that asa sole response appear to have more Middle Eastern geneticancestry while those endorsing Middle Eastern and Euro-pean ethnicity show more evidence of European geneticancestry However in Figure S1Awe also observe substantialnumbers of individuals reporting only European ethnicitywhose genetic ancestry appears to be Middle Eastern andvice versa Again these reports may reflect recent geo-graphic origin as well as nationalityethnicity

We also repeated the PC analysis after removing theAshkenazi and part-Ashkenazi subjects The PC scores forthe Ashkenazi subjects were then derived by projecting theirgenotypes onto the resulting PCs Individuals reportinga single ethnicitynationality are depicted in Figure S2Awhile those endorsing more than one are displayed in FigureS2B The first PC corresponds to a northwestndashsoutheast clinethrough Europe and the Middle East and the second PCcorresponds to a southwestndashnortheast cline within Europeas has been observed in numerous previous studies (Menozziet al 1978 Sokal et al 1991 Cavalli-Sforza et al 1993Cavalli-Sforza et al 1996 Barbujani and Bertorelle 2001Belle et al 2006 Seldin et al 2006 Bauchet et al 2007Novembre et al 2008 Price et al 2008 Tian et al 2008c)The first and second PCs account for 319 and 134 of thetotal variance of the first 10 PCs respectively

Subjects who self-identified as South Asian were also runon the EUR array and subjected to a separate PCA For thesesubjects to characterize the observed PCs and the relation-ship to geographic ancestry we employed onomastics Inparticular we analyzed surnames to characterize individualsbased on surname geographic region of origin Thesesubjects are mainly of Indian origin and the clusters formedin the PCA depict subgroups from different regions of India(Figure S3) The first PC accounts for 191 of the totalvariance of the first 10 PCs and the second PC accountsfor 100 The analysis also shows that northern Indiansare genetically closer to Europeans (Reich et al 2009) andeastern Indians are genetically more similar to East Asianpopulations As expected those reporting European as well

as South Asian ethnicity are positioned closer in the diagramto the HGDP Europeans

Structure of individuals run on the EAS array

Individuals run on the EAS array included subjects self-reporting European and East Asian raceethnicity and thosereporting solely East Asian raceethnicity The first PC forthese individuals (Figure S4 A and B) is responsible forclustering of individuals with different East AsianndashEuropeanancestry proportions (mostly 50 or 75 European) Thosewith genetic ancestry that is both East Asian and Europeanare most clearly observed in Figure S4B among those self-reporting both racesethnicities and there are very fewGERA individuals in this figure who do not have mixedgenetic ancestry Among individuals reporting only an EastAsian nationality (Figure S4A) the large majority have onlyEast Asian genetic ancestry however there are also individ-uals who appear to have mixed East AsianndashEuropean geneticancestry who self-reported only their East Asian nationalityOf particular interest is the continuous nature of a modestamount of European genetic ancestry in self-identified Fili-pinos consistent with older European admixture The sec-ond PC corresponds to the north-to-south cline in East Asia(Su et al 1999 Tian et al 2008b Hugo Pan-Asian SnpConsortium 2009) and the distinct clusters observed thatrepresent different East Asian nationalities are consistentwith extensive endogamy in these groups The first and sec-ond PCs account for 5971 and 2039 of the total varianceof the first 10 PCs respectively

Individuals endorsing a Pacific Islander ethnicity aredisplayed in Figure S5 Those also reporting an East Asianethnicity appear to cluster more closely to the HGDP EastAsians while those also reporting European ethnicity appearto cluster more closely to the HGDP Europeans While thosereporting Hawaiian and Samoan ethnicity are reasonablywell separated from both the HGDP Europeans and EastAsians some individuals who identified as ldquoother PacificIslanderrdquo appear to overlap quite closely with the HGDP EastAsians Also of interest another subgroup of ldquoother PacificIslandersrdquo appears to form its own cluster at the bottom ofFigure S5 We note that a number of these individuals self-reported both Pacific Islander and South Asian ethnicityBased on onomastics these individuals have Indian sur-names and are likely to be Indo Fijians Approximately375 of the population of Fiji is of Indian origin accordingto the 2007 census (httpwwwstatsfijigovfj) The obser-vation that some Pacific Islanders cluster near to the EastAsians is also an indication that clear separation of geneticancestry for these groups is likely to be challenging

Structure of individuals run on the AFR array

Subjects run on the AFR array revealed as expectedextensive African and European genetic ancestry (FigureS6 A and B) (Parra et al 1998 Fernandez et al 2003 Tanget al 2006 Tishkoff et al 2009 Zakharia et al 2009) Thefirst PC which accounts for 638 of the total variance of

Population Structure of GERA Cohort 1289

the first 10 PCs reflects African vs European genetic ances-try while the second PC denotes East Asian andor NativeAmerican genetic ancestry This is consistent with the arrayassignments whereby individuals reporting both AfricanAfrican American raceethnicity and East Asian or NativeAmerican raceethnicity were assigned to the AFR arrayIndividuals who self-reported African ancestry only werealso subject to onomastics to determine likely countries oforigin We were able to identify subjects of Ethiopian Eri-trean and Kenyan nationality For the Kenyans Figure S6Aindicates a location consistent with 100 African geneticancestry By contrast the EthiopianEritrean subjects occupyan intermediate position on the PC1 axis suggesting prox-imity to EuropeanMiddle Eastern populations Also of noteis the modest variation in their PC1 scores This is likely dueto ancient admixture with Middle Eastern populations(Hodgson et al 2014) These results confirm that Ethiopianshave a unique genetic structure among African populations

Individuals self-reporting mixed African and East Asianraceethnicity generally reflect that admixture from thegenetic perspective as well (Figure S6B) however a numberof individuals who reported only African American ethnicityalso appear to have similar levels of East Asian admixture(Figure S6A) Those reporting both African American andEuropean ethnicity generally occupy a position on the PC1axis closer to Europeans than those who do not (Figure S6B)

The mean African ancestry proportion in this sample is736 6 174 There is a reasonably high level of variationin the African genetic ancestry proportion ranging from106 to 100

Structure of individuals run on the LAT array

Latinos may have ancestry deriving from multiple conti-nents including Europe Africa Asia and the Americas(Bonilla et al 2004 Tang et al 2006 2007) Figure S7Aprovides the PCA results for all those who endorsed Latinoor Native American as their sole raceethnicity PC1 repre-sents the European vs Native American axis of geneticvariation and PC2 represents the African axis of geneticvariation PC1 and PC2 account for 7095 and 1157 of thetotal variance of the first 10 PCs respectively Nearly allLatinos show evidence of EuropeanWest Asian geneticancestry and a substantial subset also show evidence ofAfrican genetic ancestry Similarly all individuals self-reporting Native American raceethnicity show some degreeof EuropeanWest Asian genetic ancestry Latinos of differentnationalities exhibit varying proportions of European Africanand Native American ancestries (Figure S7B) Those reportingMexican and Central-South American nationality have geneticancestry that is primarily European and Native American withslight but varying amounts of African ancestry Those report-ing Cuban nationality have primarily European genetic ances-try with a small number of individuals having primarilyAfrican genetic ancestry Those reporting Puerto Rican nation-ality show some Native American genetic ancestry but areprimarily admixed between European and African genetic an-

cestry Individual ancestral admixture proportions were deter-mined for these subjects and are provided in Table S5

The LAT array also included a variety of individuals whoself-reported more than one raceethnicity These individu-als are represented in Figure S7C Individuals who reportedEuropean as well as Latino raceethnicity tend to haveslightly more European genetic ancestry than those whodid not similarly a number of individuals who reportedAfricanAfrican American raceethnicity in addition toLatino raceethnicity have substantial African genetic ances-try however many such individuals also appear to have thesame modest degree of African genetic ancestry as thosewho reported only a Latino raceethnicity Those whoreported Native American raceethnicity in addition to La-tino raceethnicity also appear to have slightly increasedNative American genetic ancestry Those who reported Eu-ropean and Native American raceethnicity appear to besimilar to those who solely reported Native Americanraceethnicity all have EuropeanWest Asian genetic ances-try and while some show evidence of Native American ge-netic ancestry EuropeanWest Asian is the sole or primarygenetic ancestry for the majority For those with 100 Eu-ropean genetic ancestry and who self-reported only Euro-pean and Native American raceethnicity (n = 2155) wealso calculated European PCs Finally those who reportedEast Asian in addition to Latino raceethnicity generallyhave evidence of East Asian genetic ancestry (as observedin Figure S7C by proximity to the HGDP East Asians) rang-ing from 25 to 50 and 100

Global PCA for GERA subjects

Figure S8 shows that the first PC mainly separates Euro-peans from East Asians (and Native Americans) and PC2separates Africans from all the other groups PC3 seems toseparate Native Americans from the other groups and PC4also separates Native Americans from the other groups butalso shows some separation among the Europeans PC5 sep-arates the different East Asian groups (mainly north vssouth) and also East Asians from Oceania and PC6 sepa-rates CentralndashSouth Asians from the other groups PC7again separates the various East Asian regions and PC8separates the European groups (mainly north to south)PC9 and PC10 separate East Asians from Oceania but alsothe Russians (not labeled) are separated from the otherEuropean groups

Relationship between self-reported raceethnicity andgenetic ancestry

Table S6 displays the full relationship of self-reported raceethnicity to genetic ancestry for the six continental geneticancestries of EuropeWest Asia Africa East Asia PacificIslands South Asia and the Americas A genetic continentalancestry was assigned to an individual if herhis estimatefor that ancestry was at least 5 A total of 91502 indi-viduals (939) reported a single raceethnicity 5475individuals reported two racesethnicities (59) and 512

1290 Y Banda et al

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 4: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

analyses When the HGDP samples were included in sub-sequent runs and projected onto the GERA principal compo-nents (PCs) to facilitate geographic interpretation 43988high-performing SNPs were used Initial analyses revealedthat a number of individuals appeared to be discordant be-tween their genetic ancestry and the array to which theywere assigned and the PCA was re-run after reclassifyingthese individuals (see SI Methods File S1)

PC projection approach PCA requires the inversion of a datamatrix which for very large data sets may be computationallychallenging For the East Asian African American and Latinosubgroups in the GERA data set the sample sizes were smallenough so that all subjects within each subgroup were runtogether For example all 7520 East Asian subjects were runtogether in one PCA The white-European American samplehowever is very large and required inverting an 80000 by80000 (64 billion elements) matrix Furthermore the ver-sion of the Smartpca program used at the time of analyseswas not able to analyze the entire European ancestry sampleof 83000 individuals Therefore our approach was toselect a large but manageable number of subjects on whichto perform an initial PCA and then use the resulting SNPloadings to project the remaining subjects

Because we planned to select a random subset of 20000individuals for the initial PCA on which the remainingsubjects would be projected we examined the effect ofusing different subsets by calculating the correlations of theSNP loadings for three different random subsets (SupportingInformation Table S1) The numbers of subjects in the threesubsets were the following 18677 for set 1 20121 for set2 and 17691 for set 3 For the first six PCs there was verygood correlation of the SNP loadings for all three pairs ofsubsets also suggesting that most of the signal regardinggenetic structure is derived from the first six PCs Giventhese results we selected a random set of 20000 Europeanancestry subjects and projected the remaining subjects ontothe PCs obtained

Since the SNPs used for the PCA and admixture estima-tion were common among all four genotyping arrays it waspossible to produce ldquoglobalrdquo PCA scores for the GERA sub-jects Subsets of individuals from the EUR (15500) AFR(3100) EAS (5600) and LAT (3000) arrays were used forthe initial PCA and the remaining subjects were projectedonto these PCs to obtain PC scores for each individual TableS4 shows the number of SNPs remaining after LD and struc-tural variation loci pruning for each of the eight differentPCA runs (File S1)

Genetic ancestryadmixture estimation

To determine individual ancestral admixture proportions inadmixed subjects such as African Americans and Latinos(and others) the full maximum-likelihood software packagefrappe (Tang et al 2005) was used In this analysis individ-ual ancestry proportions are estimated by calculating theprobability of a set of genome-wide genotypes in an individ-

ual as a weighted average of allele frequencies of putativeancestors where the weights represent the admixture pro-portions In general the same HGDP population samplesdescribed above were used to derive allele frequencies forthe ancestral groupsRelationship determination

Relationships were determined using the softwareKING_v14 (Manichaikul et al 2010) with the robust versionthat allows for population substructure KING provides stan-dard thresholds for characterizing monozygotic twinparentndashchild and sibling relationships which we followedIn our data these relationships were clearly separated intodistinct clusters All subjects were included irrespective ofthe array type used for their analysis This analysis wasbased on the 144799 high-performing SNPs common acrossthe four arrays described above

Results

Distribution of raceethnicitynationalitycategories reported

This multi-ethnic cohort includes representation from a broaddistribution of racesethnicitiesnationalities (Table 1) Forindividuals who reported more than one category all catego-ries are included hence the numbers in Table 1 sum togreater than 103006 the total cohort size All of the majorcontinents are represented and many nationalitiesethnicitiesCollapsing the selections into raceethnicity categories (seeMaterials and Methods) of the 106733 total selections3365 (32) include an AfricanAfrican American raceethnicity 7620 (71) include a Latino raceethnicity 575(05) include South Asian raceethnicity 7607 (71)include an East Asian raceethnicity 290 include a PacificIslander raceethnicity (03) 3884 (36) include NativeAmerican raceethnicity and 83392 (781) includea white-European raceethnicity The majority of thoseendorsing a Latino raceethnicity are Mexican and CentralAmerican while the largest groups endorsing an East Asianraceethnicity are Chinese Japanese and Filipino We alsoexamined the sex and age distributions across the differentcategories (Table 1) Compared to those reporting white-European raceethnicity those endorsing AfricanAfro-CaribbeanLatino East Asian and Pacific Islander raceethnicity areyounger with the exception of those reporting Mexican na-tionality the Latino groups tend to have a higher proportionof females as do those reporting Ashkenazi Jewish ethnicitythose reporting South Asian and Middle Eastern nationalitieshave a lower proportion of females

Structure of individuals run on the EUR array

Individuals who self-reported Ashkenazi Middle Easternand non-Hispanic white or European raceethnicity but noother ethnicities were run on the EUR array and analyzedtogether The initial analysis showed as expected a clearAshkenazi cluster and a larger cluster depicting the

1288 Y Banda et al

northwestndashsoutheast European cline (Price et al 2008 Tianet al 2008c) Figure S1A shows those who self-reporteda single ethnicitynationality while Figure S1B shows indi-viduals who self-reported more than one It is evident thatendorsement of more than one ethnicity can imply mixedgenetic ancestry but not automatically Comparing FigureS1 A and B we observe a higher proportion of individualswith mixed genetic ancestry among those who endorsedboth Ashkenazi and European or Middle Eastern ethnicityhowever we still observe a large proportion of nonadmixedindividuals suggesting that endorsement of Ashkenazi andEuropean may reflect a joint perception of ethnicity andcontinent of origin By contrast in Figure S1A we observea substantial number of individuals who appear to haveAshkenazi and European admixture but self-reported a singlecategory only (most often European)

A similar observation can be made about those endorsingMiddle Eastern ethnicity where those endorsing that asa sole response appear to have more Middle Eastern geneticancestry while those endorsing Middle Eastern and Euro-pean ethnicity show more evidence of European geneticancestry However in Figure S1Awe also observe substantialnumbers of individuals reporting only European ethnicitywhose genetic ancestry appears to be Middle Eastern andvice versa Again these reports may reflect recent geo-graphic origin as well as nationalityethnicity

We also repeated the PC analysis after removing theAshkenazi and part-Ashkenazi subjects The PC scores forthe Ashkenazi subjects were then derived by projecting theirgenotypes onto the resulting PCs Individuals reportinga single ethnicitynationality are depicted in Figure S2Awhile those endorsing more than one are displayed in FigureS2B The first PC corresponds to a northwestndashsoutheast clinethrough Europe and the Middle East and the second PCcorresponds to a southwestndashnortheast cline within Europeas has been observed in numerous previous studies (Menozziet al 1978 Sokal et al 1991 Cavalli-Sforza et al 1993Cavalli-Sforza et al 1996 Barbujani and Bertorelle 2001Belle et al 2006 Seldin et al 2006 Bauchet et al 2007Novembre et al 2008 Price et al 2008 Tian et al 2008c)The first and second PCs account for 319 and 134 of thetotal variance of the first 10 PCs respectively

Subjects who self-identified as South Asian were also runon the EUR array and subjected to a separate PCA For thesesubjects to characterize the observed PCs and the relation-ship to geographic ancestry we employed onomastics Inparticular we analyzed surnames to characterize individualsbased on surname geographic region of origin Thesesubjects are mainly of Indian origin and the clusters formedin the PCA depict subgroups from different regions of India(Figure S3) The first PC accounts for 191 of the totalvariance of the first 10 PCs and the second PC accountsfor 100 The analysis also shows that northern Indiansare genetically closer to Europeans (Reich et al 2009) andeastern Indians are genetically more similar to East Asianpopulations As expected those reporting European as well

as South Asian ethnicity are positioned closer in the diagramto the HGDP Europeans

Structure of individuals run on the EAS array

Individuals run on the EAS array included subjects self-reporting European and East Asian raceethnicity and thosereporting solely East Asian raceethnicity The first PC forthese individuals (Figure S4 A and B) is responsible forclustering of individuals with different East AsianndashEuropeanancestry proportions (mostly 50 or 75 European) Thosewith genetic ancestry that is both East Asian and Europeanare most clearly observed in Figure S4B among those self-reporting both racesethnicities and there are very fewGERA individuals in this figure who do not have mixedgenetic ancestry Among individuals reporting only an EastAsian nationality (Figure S4A) the large majority have onlyEast Asian genetic ancestry however there are also individ-uals who appear to have mixed East AsianndashEuropean geneticancestry who self-reported only their East Asian nationalityOf particular interest is the continuous nature of a modestamount of European genetic ancestry in self-identified Fili-pinos consistent with older European admixture The sec-ond PC corresponds to the north-to-south cline in East Asia(Su et al 1999 Tian et al 2008b Hugo Pan-Asian SnpConsortium 2009) and the distinct clusters observed thatrepresent different East Asian nationalities are consistentwith extensive endogamy in these groups The first and sec-ond PCs account for 5971 and 2039 of the total varianceof the first 10 PCs respectively

Individuals endorsing a Pacific Islander ethnicity aredisplayed in Figure S5 Those also reporting an East Asianethnicity appear to cluster more closely to the HGDP EastAsians while those also reporting European ethnicity appearto cluster more closely to the HGDP Europeans While thosereporting Hawaiian and Samoan ethnicity are reasonablywell separated from both the HGDP Europeans and EastAsians some individuals who identified as ldquoother PacificIslanderrdquo appear to overlap quite closely with the HGDP EastAsians Also of interest another subgroup of ldquoother PacificIslandersrdquo appears to form its own cluster at the bottom ofFigure S5 We note that a number of these individuals self-reported both Pacific Islander and South Asian ethnicityBased on onomastics these individuals have Indian sur-names and are likely to be Indo Fijians Approximately375 of the population of Fiji is of Indian origin accordingto the 2007 census (httpwwwstatsfijigovfj) The obser-vation that some Pacific Islanders cluster near to the EastAsians is also an indication that clear separation of geneticancestry for these groups is likely to be challenging

Structure of individuals run on the AFR array

Subjects run on the AFR array revealed as expectedextensive African and European genetic ancestry (FigureS6 A and B) (Parra et al 1998 Fernandez et al 2003 Tanget al 2006 Tishkoff et al 2009 Zakharia et al 2009) Thefirst PC which accounts for 638 of the total variance of

Population Structure of GERA Cohort 1289

the first 10 PCs reflects African vs European genetic ances-try while the second PC denotes East Asian andor NativeAmerican genetic ancestry This is consistent with the arrayassignments whereby individuals reporting both AfricanAfrican American raceethnicity and East Asian or NativeAmerican raceethnicity were assigned to the AFR arrayIndividuals who self-reported African ancestry only werealso subject to onomastics to determine likely countries oforigin We were able to identify subjects of Ethiopian Eri-trean and Kenyan nationality For the Kenyans Figure S6Aindicates a location consistent with 100 African geneticancestry By contrast the EthiopianEritrean subjects occupyan intermediate position on the PC1 axis suggesting prox-imity to EuropeanMiddle Eastern populations Also of noteis the modest variation in their PC1 scores This is likely dueto ancient admixture with Middle Eastern populations(Hodgson et al 2014) These results confirm that Ethiopianshave a unique genetic structure among African populations

Individuals self-reporting mixed African and East Asianraceethnicity generally reflect that admixture from thegenetic perspective as well (Figure S6B) however a numberof individuals who reported only African American ethnicityalso appear to have similar levels of East Asian admixture(Figure S6A) Those reporting both African American andEuropean ethnicity generally occupy a position on the PC1axis closer to Europeans than those who do not (Figure S6B)

The mean African ancestry proportion in this sample is736 6 174 There is a reasonably high level of variationin the African genetic ancestry proportion ranging from106 to 100

Structure of individuals run on the LAT array

Latinos may have ancestry deriving from multiple conti-nents including Europe Africa Asia and the Americas(Bonilla et al 2004 Tang et al 2006 2007) Figure S7Aprovides the PCA results for all those who endorsed Latinoor Native American as their sole raceethnicity PC1 repre-sents the European vs Native American axis of geneticvariation and PC2 represents the African axis of geneticvariation PC1 and PC2 account for 7095 and 1157 of thetotal variance of the first 10 PCs respectively Nearly allLatinos show evidence of EuropeanWest Asian geneticancestry and a substantial subset also show evidence ofAfrican genetic ancestry Similarly all individuals self-reporting Native American raceethnicity show some degreeof EuropeanWest Asian genetic ancestry Latinos of differentnationalities exhibit varying proportions of European Africanand Native American ancestries (Figure S7B) Those reportingMexican and Central-South American nationality have geneticancestry that is primarily European and Native American withslight but varying amounts of African ancestry Those report-ing Cuban nationality have primarily European genetic ances-try with a small number of individuals having primarilyAfrican genetic ancestry Those reporting Puerto Rican nation-ality show some Native American genetic ancestry but areprimarily admixed between European and African genetic an-

cestry Individual ancestral admixture proportions were deter-mined for these subjects and are provided in Table S5

The LAT array also included a variety of individuals whoself-reported more than one raceethnicity These individu-als are represented in Figure S7C Individuals who reportedEuropean as well as Latino raceethnicity tend to haveslightly more European genetic ancestry than those whodid not similarly a number of individuals who reportedAfricanAfrican American raceethnicity in addition toLatino raceethnicity have substantial African genetic ances-try however many such individuals also appear to have thesame modest degree of African genetic ancestry as thosewho reported only a Latino raceethnicity Those whoreported Native American raceethnicity in addition to La-tino raceethnicity also appear to have slightly increasedNative American genetic ancestry Those who reported Eu-ropean and Native American raceethnicity appear to besimilar to those who solely reported Native Americanraceethnicity all have EuropeanWest Asian genetic ances-try and while some show evidence of Native American ge-netic ancestry EuropeanWest Asian is the sole or primarygenetic ancestry for the majority For those with 100 Eu-ropean genetic ancestry and who self-reported only Euro-pean and Native American raceethnicity (n = 2155) wealso calculated European PCs Finally those who reportedEast Asian in addition to Latino raceethnicity generallyhave evidence of East Asian genetic ancestry (as observedin Figure S7C by proximity to the HGDP East Asians) rang-ing from 25 to 50 and 100

Global PCA for GERA subjects

Figure S8 shows that the first PC mainly separates Euro-peans from East Asians (and Native Americans) and PC2separates Africans from all the other groups PC3 seems toseparate Native Americans from the other groups and PC4also separates Native Americans from the other groups butalso shows some separation among the Europeans PC5 sep-arates the different East Asian groups (mainly north vssouth) and also East Asians from Oceania and PC6 sepa-rates CentralndashSouth Asians from the other groups PC7again separates the various East Asian regions and PC8separates the European groups (mainly north to south)PC9 and PC10 separate East Asians from Oceania but alsothe Russians (not labeled) are separated from the otherEuropean groups

Relationship between self-reported raceethnicity andgenetic ancestry

Table S6 displays the full relationship of self-reported raceethnicity to genetic ancestry for the six continental geneticancestries of EuropeWest Asia Africa East Asia PacificIslands South Asia and the Americas A genetic continentalancestry was assigned to an individual if herhis estimatefor that ancestry was at least 5 A total of 91502 indi-viduals (939) reported a single raceethnicity 5475individuals reported two racesethnicities (59) and 512

1290 Y Banda et al

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 5: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

northwestndashsoutheast European cline (Price et al 2008 Tianet al 2008c) Figure S1A shows those who self-reporteda single ethnicitynationality while Figure S1B shows indi-viduals who self-reported more than one It is evident thatendorsement of more than one ethnicity can imply mixedgenetic ancestry but not automatically Comparing FigureS1 A and B we observe a higher proportion of individualswith mixed genetic ancestry among those who endorsedboth Ashkenazi and European or Middle Eastern ethnicityhowever we still observe a large proportion of nonadmixedindividuals suggesting that endorsement of Ashkenazi andEuropean may reflect a joint perception of ethnicity andcontinent of origin By contrast in Figure S1A we observea substantial number of individuals who appear to haveAshkenazi and European admixture but self-reported a singlecategory only (most often European)

A similar observation can be made about those endorsingMiddle Eastern ethnicity where those endorsing that asa sole response appear to have more Middle Eastern geneticancestry while those endorsing Middle Eastern and Euro-pean ethnicity show more evidence of European geneticancestry However in Figure S1Awe also observe substantialnumbers of individuals reporting only European ethnicitywhose genetic ancestry appears to be Middle Eastern andvice versa Again these reports may reflect recent geo-graphic origin as well as nationalityethnicity

We also repeated the PC analysis after removing theAshkenazi and part-Ashkenazi subjects The PC scores forthe Ashkenazi subjects were then derived by projecting theirgenotypes onto the resulting PCs Individuals reportinga single ethnicitynationality are depicted in Figure S2Awhile those endorsing more than one are displayed in FigureS2B The first PC corresponds to a northwestndashsoutheast clinethrough Europe and the Middle East and the second PCcorresponds to a southwestndashnortheast cline within Europeas has been observed in numerous previous studies (Menozziet al 1978 Sokal et al 1991 Cavalli-Sforza et al 1993Cavalli-Sforza et al 1996 Barbujani and Bertorelle 2001Belle et al 2006 Seldin et al 2006 Bauchet et al 2007Novembre et al 2008 Price et al 2008 Tian et al 2008c)The first and second PCs account for 319 and 134 of thetotal variance of the first 10 PCs respectively

Subjects who self-identified as South Asian were also runon the EUR array and subjected to a separate PCA For thesesubjects to characterize the observed PCs and the relation-ship to geographic ancestry we employed onomastics Inparticular we analyzed surnames to characterize individualsbased on surname geographic region of origin Thesesubjects are mainly of Indian origin and the clusters formedin the PCA depict subgroups from different regions of India(Figure S3) The first PC accounts for 191 of the totalvariance of the first 10 PCs and the second PC accountsfor 100 The analysis also shows that northern Indiansare genetically closer to Europeans (Reich et al 2009) andeastern Indians are genetically more similar to East Asianpopulations As expected those reporting European as well

as South Asian ethnicity are positioned closer in the diagramto the HGDP Europeans

Structure of individuals run on the EAS array

Individuals run on the EAS array included subjects self-reporting European and East Asian raceethnicity and thosereporting solely East Asian raceethnicity The first PC forthese individuals (Figure S4 A and B) is responsible forclustering of individuals with different East AsianndashEuropeanancestry proportions (mostly 50 or 75 European) Thosewith genetic ancestry that is both East Asian and Europeanare most clearly observed in Figure S4B among those self-reporting both racesethnicities and there are very fewGERA individuals in this figure who do not have mixedgenetic ancestry Among individuals reporting only an EastAsian nationality (Figure S4A) the large majority have onlyEast Asian genetic ancestry however there are also individ-uals who appear to have mixed East AsianndashEuropean geneticancestry who self-reported only their East Asian nationalityOf particular interest is the continuous nature of a modestamount of European genetic ancestry in self-identified Fili-pinos consistent with older European admixture The sec-ond PC corresponds to the north-to-south cline in East Asia(Su et al 1999 Tian et al 2008b Hugo Pan-Asian SnpConsortium 2009) and the distinct clusters observed thatrepresent different East Asian nationalities are consistentwith extensive endogamy in these groups The first and sec-ond PCs account for 5971 and 2039 of the total varianceof the first 10 PCs respectively

Individuals endorsing a Pacific Islander ethnicity aredisplayed in Figure S5 Those also reporting an East Asianethnicity appear to cluster more closely to the HGDP EastAsians while those also reporting European ethnicity appearto cluster more closely to the HGDP Europeans While thosereporting Hawaiian and Samoan ethnicity are reasonablywell separated from both the HGDP Europeans and EastAsians some individuals who identified as ldquoother PacificIslanderrdquo appear to overlap quite closely with the HGDP EastAsians Also of interest another subgroup of ldquoother PacificIslandersrdquo appears to form its own cluster at the bottom ofFigure S5 We note that a number of these individuals self-reported both Pacific Islander and South Asian ethnicityBased on onomastics these individuals have Indian sur-names and are likely to be Indo Fijians Approximately375 of the population of Fiji is of Indian origin accordingto the 2007 census (httpwwwstatsfijigovfj) The obser-vation that some Pacific Islanders cluster near to the EastAsians is also an indication that clear separation of geneticancestry for these groups is likely to be challenging

Structure of individuals run on the AFR array

Subjects run on the AFR array revealed as expectedextensive African and European genetic ancestry (FigureS6 A and B) (Parra et al 1998 Fernandez et al 2003 Tanget al 2006 Tishkoff et al 2009 Zakharia et al 2009) Thefirst PC which accounts for 638 of the total variance of

Population Structure of GERA Cohort 1289

the first 10 PCs reflects African vs European genetic ances-try while the second PC denotes East Asian andor NativeAmerican genetic ancestry This is consistent with the arrayassignments whereby individuals reporting both AfricanAfrican American raceethnicity and East Asian or NativeAmerican raceethnicity were assigned to the AFR arrayIndividuals who self-reported African ancestry only werealso subject to onomastics to determine likely countries oforigin We were able to identify subjects of Ethiopian Eri-trean and Kenyan nationality For the Kenyans Figure S6Aindicates a location consistent with 100 African geneticancestry By contrast the EthiopianEritrean subjects occupyan intermediate position on the PC1 axis suggesting prox-imity to EuropeanMiddle Eastern populations Also of noteis the modest variation in their PC1 scores This is likely dueto ancient admixture with Middle Eastern populations(Hodgson et al 2014) These results confirm that Ethiopianshave a unique genetic structure among African populations

Individuals self-reporting mixed African and East Asianraceethnicity generally reflect that admixture from thegenetic perspective as well (Figure S6B) however a numberof individuals who reported only African American ethnicityalso appear to have similar levels of East Asian admixture(Figure S6A) Those reporting both African American andEuropean ethnicity generally occupy a position on the PC1axis closer to Europeans than those who do not (Figure S6B)

The mean African ancestry proportion in this sample is736 6 174 There is a reasonably high level of variationin the African genetic ancestry proportion ranging from106 to 100

Structure of individuals run on the LAT array

Latinos may have ancestry deriving from multiple conti-nents including Europe Africa Asia and the Americas(Bonilla et al 2004 Tang et al 2006 2007) Figure S7Aprovides the PCA results for all those who endorsed Latinoor Native American as their sole raceethnicity PC1 repre-sents the European vs Native American axis of geneticvariation and PC2 represents the African axis of geneticvariation PC1 and PC2 account for 7095 and 1157 of thetotal variance of the first 10 PCs respectively Nearly allLatinos show evidence of EuropeanWest Asian geneticancestry and a substantial subset also show evidence ofAfrican genetic ancestry Similarly all individuals self-reporting Native American raceethnicity show some degreeof EuropeanWest Asian genetic ancestry Latinos of differentnationalities exhibit varying proportions of European Africanand Native American ancestries (Figure S7B) Those reportingMexican and Central-South American nationality have geneticancestry that is primarily European and Native American withslight but varying amounts of African ancestry Those report-ing Cuban nationality have primarily European genetic ances-try with a small number of individuals having primarilyAfrican genetic ancestry Those reporting Puerto Rican nation-ality show some Native American genetic ancestry but areprimarily admixed between European and African genetic an-

cestry Individual ancestral admixture proportions were deter-mined for these subjects and are provided in Table S5

The LAT array also included a variety of individuals whoself-reported more than one raceethnicity These individu-als are represented in Figure S7C Individuals who reportedEuropean as well as Latino raceethnicity tend to haveslightly more European genetic ancestry than those whodid not similarly a number of individuals who reportedAfricanAfrican American raceethnicity in addition toLatino raceethnicity have substantial African genetic ances-try however many such individuals also appear to have thesame modest degree of African genetic ancestry as thosewho reported only a Latino raceethnicity Those whoreported Native American raceethnicity in addition to La-tino raceethnicity also appear to have slightly increasedNative American genetic ancestry Those who reported Eu-ropean and Native American raceethnicity appear to besimilar to those who solely reported Native Americanraceethnicity all have EuropeanWest Asian genetic ances-try and while some show evidence of Native American ge-netic ancestry EuropeanWest Asian is the sole or primarygenetic ancestry for the majority For those with 100 Eu-ropean genetic ancestry and who self-reported only Euro-pean and Native American raceethnicity (n = 2155) wealso calculated European PCs Finally those who reportedEast Asian in addition to Latino raceethnicity generallyhave evidence of East Asian genetic ancestry (as observedin Figure S7C by proximity to the HGDP East Asians) rang-ing from 25 to 50 and 100

Global PCA for GERA subjects

Figure S8 shows that the first PC mainly separates Euro-peans from East Asians (and Native Americans) and PC2separates Africans from all the other groups PC3 seems toseparate Native Americans from the other groups and PC4also separates Native Americans from the other groups butalso shows some separation among the Europeans PC5 sep-arates the different East Asian groups (mainly north vssouth) and also East Asians from Oceania and PC6 sepa-rates CentralndashSouth Asians from the other groups PC7again separates the various East Asian regions and PC8separates the European groups (mainly north to south)PC9 and PC10 separate East Asians from Oceania but alsothe Russians (not labeled) are separated from the otherEuropean groups

Relationship between self-reported raceethnicity andgenetic ancestry

Table S6 displays the full relationship of self-reported raceethnicity to genetic ancestry for the six continental geneticancestries of EuropeWest Asia Africa East Asia PacificIslands South Asia and the Americas A genetic continentalancestry was assigned to an individual if herhis estimatefor that ancestry was at least 5 A total of 91502 indi-viduals (939) reported a single raceethnicity 5475individuals reported two racesethnicities (59) and 512

1290 Y Banda et al

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 6: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

the first 10 PCs reflects African vs European genetic ances-try while the second PC denotes East Asian andor NativeAmerican genetic ancestry This is consistent with the arrayassignments whereby individuals reporting both AfricanAfrican American raceethnicity and East Asian or NativeAmerican raceethnicity were assigned to the AFR arrayIndividuals who self-reported African ancestry only werealso subject to onomastics to determine likely countries oforigin We were able to identify subjects of Ethiopian Eri-trean and Kenyan nationality For the Kenyans Figure S6Aindicates a location consistent with 100 African geneticancestry By contrast the EthiopianEritrean subjects occupyan intermediate position on the PC1 axis suggesting prox-imity to EuropeanMiddle Eastern populations Also of noteis the modest variation in their PC1 scores This is likely dueto ancient admixture with Middle Eastern populations(Hodgson et al 2014) These results confirm that Ethiopianshave a unique genetic structure among African populations

Individuals self-reporting mixed African and East Asianraceethnicity generally reflect that admixture from thegenetic perspective as well (Figure S6B) however a numberof individuals who reported only African American ethnicityalso appear to have similar levels of East Asian admixture(Figure S6A) Those reporting both African American andEuropean ethnicity generally occupy a position on the PC1axis closer to Europeans than those who do not (Figure S6B)

The mean African ancestry proportion in this sample is736 6 174 There is a reasonably high level of variationin the African genetic ancestry proportion ranging from106 to 100

Structure of individuals run on the LAT array

Latinos may have ancestry deriving from multiple conti-nents including Europe Africa Asia and the Americas(Bonilla et al 2004 Tang et al 2006 2007) Figure S7Aprovides the PCA results for all those who endorsed Latinoor Native American as their sole raceethnicity PC1 repre-sents the European vs Native American axis of geneticvariation and PC2 represents the African axis of geneticvariation PC1 and PC2 account for 7095 and 1157 of thetotal variance of the first 10 PCs respectively Nearly allLatinos show evidence of EuropeanWest Asian geneticancestry and a substantial subset also show evidence ofAfrican genetic ancestry Similarly all individuals self-reporting Native American raceethnicity show some degreeof EuropeanWest Asian genetic ancestry Latinos of differentnationalities exhibit varying proportions of European Africanand Native American ancestries (Figure S7B) Those reportingMexican and Central-South American nationality have geneticancestry that is primarily European and Native American withslight but varying amounts of African ancestry Those report-ing Cuban nationality have primarily European genetic ances-try with a small number of individuals having primarilyAfrican genetic ancestry Those reporting Puerto Rican nation-ality show some Native American genetic ancestry but areprimarily admixed between European and African genetic an-

cestry Individual ancestral admixture proportions were deter-mined for these subjects and are provided in Table S5

The LAT array also included a variety of individuals whoself-reported more than one raceethnicity These individu-als are represented in Figure S7C Individuals who reportedEuropean as well as Latino raceethnicity tend to haveslightly more European genetic ancestry than those whodid not similarly a number of individuals who reportedAfricanAfrican American raceethnicity in addition toLatino raceethnicity have substantial African genetic ances-try however many such individuals also appear to have thesame modest degree of African genetic ancestry as thosewho reported only a Latino raceethnicity Those whoreported Native American raceethnicity in addition to La-tino raceethnicity also appear to have slightly increasedNative American genetic ancestry Those who reported Eu-ropean and Native American raceethnicity appear to besimilar to those who solely reported Native Americanraceethnicity all have EuropeanWest Asian genetic ances-try and while some show evidence of Native American ge-netic ancestry EuropeanWest Asian is the sole or primarygenetic ancestry for the majority For those with 100 Eu-ropean genetic ancestry and who self-reported only Euro-pean and Native American raceethnicity (n = 2155) wealso calculated European PCs Finally those who reportedEast Asian in addition to Latino raceethnicity generallyhave evidence of East Asian genetic ancestry (as observedin Figure S7C by proximity to the HGDP East Asians) rang-ing from 25 to 50 and 100

Global PCA for GERA subjects

Figure S8 shows that the first PC mainly separates Euro-peans from East Asians (and Native Americans) and PC2separates Africans from all the other groups PC3 seems toseparate Native Americans from the other groups and PC4also separates Native Americans from the other groups butalso shows some separation among the Europeans PC5 sep-arates the different East Asian groups (mainly north vssouth) and also East Asians from Oceania and PC6 sepa-rates CentralndashSouth Asians from the other groups PC7again separates the various East Asian regions and PC8separates the European groups (mainly north to south)PC9 and PC10 separate East Asians from Oceania but alsothe Russians (not labeled) are separated from the otherEuropean groups

Relationship between self-reported raceethnicity andgenetic ancestry

Table S6 displays the full relationship of self-reported raceethnicity to genetic ancestry for the six continental geneticancestries of EuropeWest Asia Africa East Asia PacificIslands South Asia and the Americas A genetic continentalancestry was assigned to an individual if herhis estimatefor that ancestry was at least 5 A total of 91502 indi-viduals (939) reported a single raceethnicity 5475individuals reported two racesethnicities (59) and 512

1290 Y Banda et al

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 7: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

individuals (05) reported three racesethnicities (Table 2)As expected all individuals who self-identified as EuropeanWest Asian had evidence of EuropeanWest Asian geneticancestry The next largest genetic ancestry component in thisgroup was South Asian (43) primarily attributable to indi-viduals of West Asian ethnicity Because there is a continuumof genetic ancestry from Europe to West Asia Central-SouthAsia to East Asia genetic overlap exists for individuals whosenational origins are geographically between these divisions(Li et al 2008) Nearly 1 of this group also had evidenceof Native American genetic ancestry while a smaller fractionhad evidence of African or East Asian genetic ancestry (03and 04 respectively) Nearly all individuals (997) self-reporting AfricanAfrican American raceethnicity had evi-dence of African genetic ancestry 91 also had evidence ofEuropean genetic ancestry consistent with broad Europeanadmixture among African Americans Native American andEast Asian genetic ancestry occurred in this group at a similarlow level as observed in the EuropeansWest Asians (13 and05 respectively) Among self-reported East Asians all hadevidence of East Asian genetic ancestry a sizable proportion(217) also had evidence of Pacific Islander genetic ances-try but this likely represents difficulty in differentiating EastAsian and Pacific Islander genetic ancestry A modest sub-group (34) had evidence of EuropeanWest Asian geneticancestry (majority are self-reported Filipinos) while smallproportions had evidence of African or Native American ge-netic ancestry (01 and 05 respectively) Among the Lati-nos nearly all had evidence of EuropeanWest Asian geneticancestry a similar high proportion (942) had evidence ofNative American genetic ancestry and an additional 277had evidence of African ancestry A substantial number of self-reported Pacific Islanders had evidence of East Asian geneticancestry (913) in addition to Pacific Islander genetic an-cestry (663) these results are again likely due to closegenetic similarity between East Asians and Pacific IslandersThere is also evidence of substantial EuropeanWest Asianand South Asian genetic ancestry in this group (576 and261 respectively) The former reflects a high rate of Euro-pean admixture among some self-reported Pacific Islandergroups while the latter likely reflects Fijians of Indian originMost self-reported South Asians have evidence of South Asiangenetic ancestry a substantial proportion also has evidence ofEuropean or East Asian genetic ancestry likely due to inabilityto cleanly separate South Asian genetic ancestry from WestAsian or East Asian (Li et al 2008) Among those reportingNative American raceethnicity 144 have evidence of Na-tive American genetic ancestry and all have evidence ofEuropeanWest Asian genetic ancestry

For those with missing or mis-scanned self-reported raceethnicity and whose raceethnicity was derived from KPadministrative databases (Table 3 and Table S7) resultsalign closely with those in Table 2 For individuals self-reporting two or three racesethnicities the correspondencebetween the self-report and genetic ancestry is generallyquite high (Table 2)

We also observed a decrease in average age and increasingproportion of females with the number of different raceethnicityancestry groups reported (Table 2) While thedifferent minority groups and in particular the self-reportedEast Asians and Latinos are younger on average those report-ing mixed raceethnicity are even younger These patternslikely reflect increasing exogamy over time As expected thesepatterns are also reflected in the genetic PC scores where forexample the proportion of mixed East AsianEuropean ge-netic ancestry increases with decreasing age The excess offemales among those reporting mixed raceethnicity appearsto reflect a reporting preference as there was no significantdifference in the proportion of individuals with mixed geneticancestry by sex

A more in-depth examination of the distribution of con-tinental genetic ancestry for the various self-report raceethnicity groups is provided in Table S8

Relatives

We were able to clearly identify first-degree relative(parentndashchild and full sib) and MZ twin pairs and catego-rized them based on self-reported raceethnicity (Figure S9and Table S9) We also observed thousands of likely second-and third-degree relatives (Figure S9) however the figurealso indicates substantial overlap between these groupsbased on kinship estimates

The 34 MZ pairs who are perfectly concordant forgenetic ancestry are also perfectly concordant for self-reported raceethnicity Sib pairs are also (virtually) identi-cal for genetic ancestry We identified a total of 2018 sibpairs 1936 (96) of whom are concordant for self-reportedraceethnicity Among the 82 discordant pairs the majority(n = 66) involve pairs where one self-reports Native Amer-ican or Latino raceethnicity (solely or in combination withEuropeanWest Asian raceethnicity) while the other reportsonly EuropeanWest Asian race ethnicity (Table S10) in mostof these cases the genetic ancestry is solely EuropeanWestAsian although in some there is also evidence of Native Amer-ican genetic ancestry A modest number of pairs are also dis-cordant in their reports of East Asian raceethnicity and againfor most of these the genetic ancestry is solely EuropeanWestAsian Similarly a few pairs with mixed genetic ancestry in-cluding African are discordant in terms of self-reporting ofAfrican American raceethnicity

We identified 3741 parentndashchild pairs of which 3478(93) were concordant for self-identified raceethnicityThe lower rate of concordance compared to the sib pairs isnot surprising as parent and child reports may differ if thechildrsquos parents are of different raceethnicity In 116 of 263discordant pairs (Table S11) the child has genetic ancestrythat herhis parent does not (Native American in 69 casesEast Asian in 41 cases and African in 11 cases) and thisdifference is reflected in the self-report where the child isself-reporting a raceethnicity that the parent is not By con-trast in only 9 cases did the parent have a genetic ancestrythat the child did not and in 8 of these 9 cases the parent has

Population Structure of GERA Cohort 1291

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 8: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

a low level of Native American ancestry (but 5) whereasthe child is below our 5 threshold Interestingly in 5 ofthese cases the parent self-reports as Latino raceethnicitybut the child does not whereas the opposite is true in 3 ofthe 8 cases In an additional 114 cases the genetic informa-tion for parent and child matches but the self-reports for raceethnicity are different The largest subgroup (49) of thesecases reflects differences in the reporting of Native Americanor Latino raceethnicity and in 47 of these there is no evi-dence of Native American genetic ancestry in the parent orchild it is approximately equally split as to whether the par-ent or child reports the Native American raceethnicityAmong 53 cases where parent and child are discordant forself-report of Latino raceethnicity in 23 it is the child whoself-reports Latino raceethnicity whereas the parent doesnot There are 11 cases of discordance for self-report of EastAsian raceethnicity and in nearly all of them there is no

evidence of continental East Asian genetic ancestry In slightlymore than half of these cases it is the parent who self-reportsEast Asian raceethnicity

Discussion

The RPGEH GERA cohort provides an excellent opportunityto characterize a large representative northern Californiapopulation from the perspectives of self-reported raceethnicitynationality and genetic ancestry Overall the co-hort is 808 non-Hispanic white and 192 minority andincludes a broad spectrum of racesethnicitiesnationalitiesThe results of our PC analyses to characterize geneticstructure within each of the major raceethnicity groupsare largely consistent with prior reports

For the non-Hispanic white individuals we see a broadspectrum of genetic ancestry ranging from northern Europe

Table 2 Proportion of individuals with genetic ancestry from each of six ancestral populations by self-reported raceethnicity

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA female Mean age (SE)

One group 91502 059 6292 (004)EW 76401 1000 0003 0004 0009 0000 0043 059 6371 (005)AA 2679 0910 0997 0005 0013 0000 0021 057 6128 (025)EA 6389 0034 0001 1000 0005 0217 0008 058 5851 (018)NA 674 0999 0022 0022 0144 0000 0037 055 6434 (051)LT 4807 0999 0277 0008 0942 0000 0024 058 5792 (021)PI 92 0576 0000 0913 0000 0663 0261 048 5689 (149)SA 460 0307 0007 0109 0004 0050 0961 039 5429 (067)

Two groups 5476 067 5737 (019)EWAA 123 1000 0976 0024 0033 0000 0081 067 5276 (150)EWEA 572 0960 0005 0942 0014 0063 0080 068 4913 (065)EWNA 2548 1000 0008 0007 0096 0000 0024 068 6163 (026)EWLT 1564 1000 0071 0010 0710 0000 0068 068 5405 (038)EWPI 48 1000 0000 0813 0042 0625 0021 079 5964 (200)EWSA 44 0955 0000 0068 0045 0000 0682 066 5355 (226)AAEA 29 0655 0931 0828 0034 0000 0069 056 5006 (246)AANA 99 1000 099 0000 0051 0000 0030 068 5967 (130)AALT 114 0991 0596 0018 0754 0000 0026 034 5509 142)AASA 13 0167 0167 0167 0083 0250 0833 017 5433 (423)EALT 95 0789 0042 0926 0642 0063 0000 067 5607 (144)EAPI 40 0275 0025 1000 0000 0475 0025 060 5693 (237)EASA 17 0059 0000 0765 0000 0059 0235 047 6206 (288)NALT 129 1000 0140 0031 0953 0000 0047 068 5822 (119)LTPI 12 1000 0417 0250 0917 0000 0167 064 5393 (395)LTSA 10 0600 0000 0400 0600 0200 0500 063 6150 (456)

Three groups 512 070 5352 (075)EWAANA 115 0991 0991 0000 0043 0000 0017 074 5971 (158)EWAALT 23 0957 0696 0043 0522 0000 0087 052 5009 (411)EWEANA 32 0969 0000 0875 0250 0000 0125 069 4606 (311)EWEALT 48 1000 0041 0857 0490 0000 0061 072 4598 (249)EWEAPI 35 0943 0000 1000 0029 0486 0000 067 5192 (302)EWNALT 198 1000 0066 0000 0803 0000 0086 070 5383 (099)

Only those with at most three self-reported raceethnicities and three genetic ancestries are included raceethnicity categories with at least 10 members are shown Forindividuals self-reporting two or three racesethnicities the correspondence between self-report and genetic ancestry is generally quite high For example for those reportingEuropeanWest Asian and East Asian raceethnicity 96 and 94 have evidence of EuropeanWest Asian and East Asian genetic ancestry respectively for those reportingAfricanAfrican American and East Asian raceethnicity 931 and 828 have evidence of African and East Asian genetic ancestry while 655 have evidence of EuropeanWest Asian genetic ancestry Among those reporting EuropeanWest Asian and Native American raceethnicity 96 have evidence of Native American genetic ancestry forthose reporting AfricanAfrican American and Native American raceethnicity 51 have evidence of Native American genetic ancestry EW EuropeanWest Asian AAAfricanAfrican AmericanAfro-Caribbean EA East Asian NA Native AmericanAlaska Native LT Latino PI Pacific Islander SA South Asian Genetic ancestry abbreviationsare the same except for AF which represents sub-Saharan African ancestry

1292 Y Banda et al

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 9: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

to southern Europe and the Middle East Within that largegroup with the exception of Ashkenazi Jews we see littleevidence of distinct clusters This is consistent with consider-able exogamy within this group By comparison we do seestructure in the East Asian population correlated withnationality reflecting continuing endogamy for thesenationalities and also recent immigration On the otherhand we did observe a substantial number of individualswho are admixed between East Asian and Europeanancestry reflecting 10 of all those reporting East Asianraceethnicity The majority of these reflected individualswith one East Asian and one European parent or one EastAsian and three European grandparents In addition wenoted that for self-reported Filipinos a substantial propor-tion have modest levels of European genetic ancestry reflect-ing older admixture

As expected most self-reported African Americans showsome degree of European genetic ancestry with an overallaverage of 26 Among individuals self-reporting as AfricanAmerican and East Asian all showed evidence of geneticancestry from three continents Africa EuropeWest Asiaand East Asia

Latinos are the most complex from a genetic perspectiveas they can possess genetic ancestry from essentially any ofthe major continents Most of the Latinos in our study derivefrom Mexico and CentralSouth America with smallerproportions from Puerto Rico and Cuba These individualshave varying proportions of Native American European andAfrican genetic ancestry We also found evidence of EastAsian genetic ancestry in some individuals but these wereprimarily individuals who self-reported both East Asian andLatino nationalities

Of note 17 of the cohort had evidence of genetic an-cestry from more than one continent However this does notmean that all or even most of these individuals representrecent continental admixture As has been true in other anal-yses (Li et al 2008) genetic similarity between West Asiansand South Asians (and to some degree South Asians and EastAsians) did not allow for a clear distinction among thesegenetic ancestries As such while some individuals were es-timated to have South Asian genetic ancestry this more likelyreflects the difficulty in demarking West Asian vs South Asiangenetic ancestry A similar situation holds for Pacific Islandersand East Asians where we and others have shown strong

genetic similarity for some Pacific Islander groups with EastAsians Also some individuals may have reported more thana single raceethnicity that may reflect recent country of or-igin in addition to or rather than more distant ancestry withIndo-Fijians as one example

If we include only individuals with genetic admixturefrom nonadjacent continents the proportion with continen-tal admixture is 12 However we also note that thisfraction depends on our cutoff of 5 for defining geneticadmixture as well as some imprecision in the admixtureestimation Of course a lower threshold would increasethe proportion of the cohort that is considered to be genet-ically admixed while a higher threshold would do theopposite

As expected in a large cohort such as this we were easilyable to identify a substantial number of close relativesspecifically 34 identical twins 2018 full sibs and 3741parent-child pairs We also had clear evidence of a largenumber of likely second-and third-degree relatives butthese kinship groups did not separate clearly from eachother More refined methods may be able to provide moreprecise kinship estimates

A major goal was to examine the relationship betweenself-reported raceethnicity and genetic ancestry By andlarge there was very high correspondence between the twoallowing for the broad range of genetic ancestry that existsamong African Americans and Latinos We were also able tocompare the self-report data of identical twins parentndashchildpairs and sib pairs All MZ twin pairs were concordant aswere most of the sib pairs However we did note that forsome sib pairs the self-report data differed For the majorityof these the discordance related to reporting of NativeAmerican or Latino raceethnicity

The results obtained here are important for the study ofcomplex genetic disease in this large population-basedcohort through association studies admixture analysis andadmixture mapping and in particular for investigatingobserved ethnic variation in diseases and traits As describedpreviously (Risch et al 2002 Tang et al 2006) the strongcorrespondence also observed here between the social cat-egories of raceethnicity and genetic ancestry makes dissec-tion of racialethnic differences challenging The patternsthat we observed reflect historical and recent mating prac-tices and their impact on genetic variation On a global levelgeography continues to create strong local endogamy whichis also reflected among the recent US migrant populationsHowever the increasing frequency of interracial individualsthat we observed in this cohortmdasha reflection of increasingexogamymdashwill enhance both the complexity of such analy-ses and the opportunities to investigate the genetic and en-vironmental contributors to racialethnic differences Whilethe advent of myriad genetic markers can provide accurateestimates of individualsrsquo genetic ancestry the social aspectsof raceethnicity may be more challenging to characterizeFor example in our study considering the various combina-tions of 7 raceethnicity categories that an individual could

Table 3 Proportion of individuals with genetic ancestry from eachof six ancestral populations by raceethnicity as determined by KPadministrative databases

Genetic ancestry

Raceethnicity n EW AF EA NA PI SA

White 4575 1 0007 0009 0017 0001 0030African American 102 0941 0990 0000 0020 0000 0020Asian 311 0106 0003 0952 0006 0167 0074Latino 255 0988 0192 0043 0816 0000 0035Otheruncertain 84 0929 0131 0357 0167 0071 0083

Abbreviations are the same as in Table 2

Population Structure of GERA Cohort 1293

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 10: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

endorse we observed 50 different combinations and thisdoes not include individuals who endorsed 3 (althoughthey were few in number) While overall 6 of the cohortendorsed more than a single category that number is likelyto grow as mating patterns continue to evolve

Acknowledgments

We thank the Kaiser Permanente Northern Californiamembers who have generously agreed to participate in theKaiser Permanente Research Program on Genes Environ-ment and Health and Judith Millar for her assistance inpreparing the manuscript for publication This work wassupported by grants RC2 AG036607 and R01 GM073059from the National Institutes of Health and by a postdoctoralfellowship from the Lamond Family Foundation The devel-opment of the Research Program on Genes Environmentand Health including enrollment and consent of partic-ipants and collection of surveys and saliva samples wassupported by grants from the Robert Wood JohnsonFoundation the Wayne and Gladys Valley Foundation theEllison Medical Foundation and Kaiser Permanente Com-munity Benefit Programs Information about data accesscan be obtained at httpwwwncbinlmnihgovprojectsgapcgibinstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Kvale et al 2015 (pp 1051ndash1060) and Lapham et al 2015 (pp 1061ndash1072) in this issuefor related works

Literature Cited

Barbujani G and G Bertorelle 2001 Genetics and the popula-tion history of Europe Proc Natl Acad Sci USA 98 22ndash25

Bauchet M B McEvoy L N Pearson E E Quillen T Sarkisianet al 2007 Measuring European population stratification withmicroarray genotype data Am J Hum Genet 80 948ndash956

Belle E M P A Landry and G Barbujani 2006 Origins andevolution of the Europeansrsquo genome evidence from multiplemicrosatellite loci Proc Biol Sci 273 1595ndash1602

Bonilla C E J Parra C L Pfaff S Dios J A Marshall et al2004 Admixture in the Hispanics of the San Luis Valley Colo-rado and its implications for complex trait gene mapping AnnHum Genet 68 139ndash153

Burchard E G E Ziv N Coyle S L Gomez H Tang et al2003 The importance of race and ethnic background in biomed-ical research and clinical practice N Engl J Med 348 1170ndash1175

Cavalli-Sforza L L 2005 The Human Genome Diversity Projectpast present and future Nat Rev Genet 6 333ndash340

Cavalli-Sforza L L P Menozzi and A Piazza 1993 Demic ex-pansions and human evolution Science 259 639ndash646

Cavalli-Sforza L L P Menozzi and A Piazza 1996 The Historyand Geography of Human Genes Princeton University PressPrinceton NJ pp xiii and 413

Cooper R S J S Kaufman and R Ward 2003 Race and ge-nomics N Engl J Med 348 1166ndash1170

Fernandez J R M D Shriver T M Beasley N Rafla-DemetriousE Parra et al 2003 Association of African genetic admixturewith resting metabolic rate and obesity among women ObesRes 11 904ndash911

Hodgson J A C J Mulligan A Al-Meeri and R L Raaum2014 Early back-to-Africa migration into the Horn of AfricaPLoS Genet 10 e1004393

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

HUGO Pan-Asian SNP Consortium M A Abdulla I Ahmed AAssawamakin J Bhak S K Brahmachari et al 2009 Mappinghuman genetic diversity in Asia Science 326 1541ndash1545

Jakobsson M S W Scholz P Scheet J R Gibbs J M VanLiereet al 2008 Genotype haplotype and copy-number variation inworldwide human populations Nature 451 998ndash1003

Krieger N D L Rowley A A Herman B Avery and M T Phillips1993 Racism sexism and social class implications for studiesof health disease and well-being Am J Prev Med 9 82ndash122

Kvale M N S E Hesselson T J Hoffmann Y Cao D Chan et al2015 Genotyping informatics and quality control for 100000subjects in the Genetic Epidemiology Research on Adult Healthand Aging (GERA) cohort DOI 101534genetics115178905

Li J Z D M Absher H Tang A M Southwick A M Casto et al2008 Worldwide human relationships inferred from genome-wide patterns of variation Science 319 1100ndash1104

Lohmueller K E A R Indap S Schmidt A R Boyko R DHernandez et al 2008 Proportionally more deleterious ge-netic variation in European than in African populations Nature451 994ndash997

Manichaikul A J C Mychaleckyj S S Rich K Daly M Sale et al2010 Robust relationship inference in genome-wide associa-tion studies Bioinformatics 26 2867ndash2873

Menozzi P A Piazza and L Cavalli-Sforza 1978 Synthetic mapsof human gene frequencies in Europeans Science 201 786ndash792

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Parra E J A Marcini J Akey J Martinson M A Batzer et al1998 Estimating African American admixture proportions byuse of population-specific alleles Am J Hum Genet 63 1839ndash1851

Patterson N A L Price and D Reich 2006 Population structureand eigenanalysis PLoS Genet 2 e190

Price A L J Butler N Patterson C Capelli V L Pascali et al2008 Discerning the ancestry of European Americans in ge-netic association studies PLoS Genet 4 e236

Reich D K Thangaraj N Patterson A L Price and L Singh2009 Reconstructing Indian population history Nature 461489ndash494

Risch N E Burchard E Ziv and H Tang 2002 Categorizationof humans in biomedical research genes race and disease Ge-nome Biol 3 comment2007

Seldin M F R Shigeta P Villoslada C Selmi J Tuomilehtoet al 2006 European population substructure clustering ofnorthern and southern populations PLoS Genet 2 e143

Sokal R R N L Oden and C Wilson 1991 Genetic evidence forthe spread of agriculture in Europe by demic diffusion Nature351 143ndash145

Su B J Xiao P Underhill R Deka W Zhang et al 1999 Y-chromosome evidence for a northward migration of modernhumans into Eastern Asia during the last Ice Age Am JHum Genet 65 1718ndash1724

Tang H J Peng P Wang and N J Risch 2005 Estimation ofindividual admixture analytical and study design considera-tions Genet Epidemiol 28 289ndash301

1294 Y Banda et al

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 11: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Tang H E Jorgenson M Gadde S L Kardia D C Rao et al2006 Racial admixture and its impact on BMI and bloodpressure in African and Mexican Americans Hum Genet119 624ndash633

Tang H S Choudhry R Mei M Morgan W Rodriguez-Cintronet al 2007 Recent genetic selection in the ancestraladmixture of Puerto Ricans Am J Hum Genet 81 626ndash633

Tian C P K Gregersen and M F Seldin 2008a Accounting forancestry population substructure and genome-wide associationstudies Hum Mol Genet 17 R143ndashR150

Tian C R Kosoy A Lee M Ransom J W Belmont et al2008b Analysis of East Asia genetic substructure using genome-wide SNP arrays PLoS One 3 e3862

Tian C R M Plenge M Ransom A Lee P Villoslada et al2008c Analysis and application of European genetic substruc-ture using 300 K SNP information PLoS Genet 4 e4

Tishkoff S A F A Reed F R Friedlaender C Ehret A Ranciaroet al 2009 The genetic structure and history of Africans andAfrican Americans Science 324 1035ndash1044

Wellcome Trust Case Consortium 2007 Genome-wide associationstudy of 14000 cases of seven common diseases and 3000shared controls Nature 447 661ndash678

Zakharia F A Basu D Absher T L Assimes A S Go et al2009 Characterizing the admixed African ancestry of AfricanAmericans Genome Biol 10 R141

Communicating editor N R Wray

Population Structure of GERA Cohort 1295

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 12: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178616-DC1

Characterizing RaceEthnicity and Genetic Ancestry for100000 Subjects in the Genetic Epidemiology Research on

Adult Health and Aging (GERA) CohortYambazi Banda Mark N Kvale Thomas J Hoffmann Stephanie E Hesselson Dilrini Ranatunga Hua Tang Chiara Sabatti

Lisa A Croen Brad P Dispensa Mary Henderson Carlos Iribarren Eric Jorgenson Lawrence H Kushi Dana LudwigDiane Olberg Charles P QuesenberryJr Sarah Rowell Marianne Sadler Lori C Sakoda Stanley Sciortino Ling Shen

David Smethurst Carol P Somkin Stephen K Van Den Eeden Lawrence Walter Rachel A Whitmer Pui-Yan KwokCatherine Schaefer and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178616

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 13: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

2 SI Y Banda et al

Figure S1 PCA of European and West Asian subjects on the EUR array A clear Ashkenazi cluster is observed The largest cluster depicts the northwest‐southeast cline within Europe A Those reporting a single ethnicity B Those reporting multiple ethnicities

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 14: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 3 SI

Figure S2 PCA of subjects who are neither South Asian nor Ashkenazi The Ashkenazi subjects were later projected onto the PCs obtained A Those reporting a single ethnicity B Those reporting more than one ethnicity

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 15: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

4 SI Y Banda et al

Figure S3 PCA of individuals reporting South Asian ethnicity either alone or in combination with European ethnicity Separate clusters of the Indian subgroups from different Indian regions identified by onomastics are also identified

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 16: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 5 SI

Figure S4 PCA of subjects on the EAS array A Individuals reporting only East Asian ancestry B Individuals reporting East Asian and European ancestry

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 17: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

6 SI Y Banda et al

Figure S5 PCA of individuals run on the EAS array reporting Pacific Islander ethnicity including those reporting another ethnicity

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 18: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 7 SI

Figure S6 PCA of individuals run on the AFR array A Individuals reporting only African or African American ethnicity individuals identified by onomastics as Ethiopian Eritrean and Kenyan depicted separately B Individuals reporting AfricanAfrican American ethnicity plus at least one additional ethnicity

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 19: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

8 SI Y Banda et al

Figure S7 PCA of individuals run on the LAT array A Individuals reporting Latino ethnicity only and Native American ethnicity only B Individuals reporting Latino ethnicity by nationality C Individuals reporting Latino ethnicity and at least one additional ethnicity and also individuals reporting Native American and European ethnicities only

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 20: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 9 SI

Figure S8 Global PCA of GERA subjects A‐F Individuals distributed according to continental differentiation Admixed individuals also observed

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 21: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

10 SI Y Banda et al

Figure S9 Identification of MZ twin and first‐degree relative pairs by KING The x‐axis represents the proportion of SNPs with zero IBS (Identity by State) eg TT and CC

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 22: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 11 SI

File S1

Supplementary Methods and Results

Adjudication of RaceEthnicity Information

As described in Methods PC and admixture analyses were performed on the genotype data to characterize the genetic ancestry

of the GERA cohort In so doing we discovered a large number of individuals (2140) run on the AFR array who were estimated

to have 100 EuropeanWest Asian genetic ancestry and a smaller number (123) to have 100 East Asian genetic ancestry A

small number (276) of individuals run on the EAS array were estimated to have 100 European ancestry This led to further

investigation of the self‐report assignments for these individuals

Direct examination of the original survey forms for these subjects revealed that the self‐reported raceethnicity information on

the form was not consistent with the computerized information for some individuals Further investigation revealed that an

artifact had occurred when the forms were originally scanned due to a black mark on the glass scanning plate overlaying the

box for African American which led to a number of subjects being recorded as having African American raceethnicity (in

addition to another raceethnicity) when in fact they did not (according to the original form) By our algorithm for array

assignment because these individuals were assumed to have some African ancestry they were assigned to the AFR array

These individuals were then adjudicated to raceethnicity categories based on the actual survey information supplemented by

other data on raceethnicity in the KP administrative databases as necessary and principal component scores were

recalculated This error only affected individuals with originally recorded African American raceethnicity from the

computerized scanning The individuals with 100 EuropeanWest Asian genetic ancestry that were run on the EAS array had

no scanning errors and hence were not adjudicated Reviewing the raceethnicity self‐report of these individuals the large

majority self‐reported both East Asian and European raceethnicity

Table S2 displays the relationship between self‐reported raceethnicity and genotyping array for those with self‐report

raceethnicity data that was not missing or adjudicated due to scanning errors Because of the time element required for

processing samples the array assignments were made based on raw data from the raceethnicity questions on the survey prior

to data cleaning After data cleaning we noticed that some individuals were assigned to arrays based on the raw information

that was not consistent with their final raceethnicity categorization and the assignment algorithm As can be seen in Table S2

this primarily affected a modest number of individuals with a final assignment of mixed raceethnicity who were run on the EUR

array Table S3 shows array assignments for individuals with scanning errors or missing self‐report raceethnicity data whose

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 23: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

12 SI Y Banda et al

final raceethnicity was determined from KPNC administrative databases In this group are the 2140 Whites and 123 Asians

with scanning errors who were run on the AFR array We also note a moderate number of individuals classified as White (267)

who were run on the EAS array

Finally we emphasize that no raceethnicity assignments or re‐assignments were based on genetic information The genetic

information was only used in the detection of the scanning problems for some individuals by comparing their computerized

responses to those on the original forms All self‐report raceethnicity data reported in Results are based on the final cleaned

and adjudicated categories

Numbers of pruned SNPs for various PCAs and Individual Admixture Estimation

To reduce the linkage disequilibrium (LD) between markers (eg those in the lactase and MHC regions on chromosomes 2 and

6 respectively) pairs of SNPs that had an r2 greater than 05 and within 5 MB of each other were considered and one member

of the pair removed Also removed were SNPs located in regions with inversions such as chromosomes 8p23 and 17q21 These

structural variations have previously been shown to influence PCA results in European ancestry samples As reported in Results

various PCA runs were performed separately for individuals genotyped on different arrays The numbers of SNPs remaining

after LD and structural variation loci pruning for each of the eight different PCA runs are shown in Table S4

For the initial individual admixture analyses a set of 43988 SNPs which were common between the HGDP dataset and our set

of 144799 high quality SNPs was used A set of 38301 SNPs (used for the global PCA run) remaining after LD pruning and

removal of SNPs located in regions with structural variations was also used for admixture analyses We found very minimal

difference in admixture estimates obtained from the two types of analyses

Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Table S8 provides a more detailed examination of the distribution of continental genetic ancestry for the various self‐report

raceethnicity groups Among those reporting EuropeanWest Asian raceethnicity 56 had evidence of genetic ancestry from

two continents however for the large majority the second continent was South Asia Hence this likely does not reflect recent

admixture but rather the genetic similarity of West Asians and South Asians For a moderate proportion the second genetic

ancestry is Native American By contrast among the self‐reported AfricansAfrican Americans 882 have evidence of genetic

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 24: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 13 SI

ancestry from two continents This represents the EuropeanWest Asian genetic admixture present in most African Americans

that has occurred over 5 centuries For those with a single genetic ancestry that ancestry is African As expected nearly all

self‐reported Latinos have genetic ancestry from more than one continent a substantial proportion (299) have genetic

ancestry from 3 continents ndash EuropeanWest Asian Native American and African while the majority (652) have genetic

ancestry from two continents EuropeanWest Asian and Native American The large majority of self‐reported East Asians have

genetic ancestry that is solely East Asian or East Asian and Pacific Islander The latter combination primarily reflects the close

genetic relatedness of East Asians to some Pacific Islander groups and not necessarily recent admixture although that likely

applies to some We do note that for a modest number of individuals the second continental ancestry is EuropeanWest Asian

Similarly for self‐reported South Asians the large percentage (381) corresponding to two continents primarily reflects

genetic similarity of West Asians and South Asians in the admixture analysis Among those self‐reporting only Native American

raceethnicity 823 have a single genetic ancestry which is EuropeanWest Asian although 177 have genetic ancestry from

two or more continents which are EuropeanWest Asian and Native American

Those who self‐reported more than one raceethnicity comprise multiple combinations Among those reporting two 51 have

a single genetic ancestry which is nearly always EuropeanWest Asian Similarly for those with genetic ancestry from more

than one continent for nearly all EuropeanWest Asian is one of them with the second continental group being Native

American (60) East Asian (22) or African (13) The pattern is similar for those with genetic ancestry from three continents

The pattern is also closely reproduced in those reporting 3 raceethnicity categories the large majority with a single continental

genetic ancestry reflects EuropeanWest Asian genetic ancestry For those with two or more continental genetic ancestries

EuropeanWest Asian is nearly always one of them but in this case African ancestry is more prominent than East Asian

ancestry as the second continent

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 25: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

14 SI Y Banda et al

Table S1 Magnitude of correlation of PC loadings for three supersets

PC Set1‐Set2 correlation Set1‐Set3 correlation Set2‐Set3 correlation

1 0981 0979 0980

2 0876 0853 0856

3 0850 0816 0809

4 0766 0776 0752

5 0658 0654 0639

6 0605 0604 0592

7 0001 0001 0210

8 0043 0179 0045

9 0312 0068 0089

10 0007 0067 0201

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 26: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 15 SI

Table S2 Self‐reported RaceEthnicity versus Genotyping Array RaceEthnicity abbreviations EW = EuropeanWest Asian AA = AfricanAfrican AmericaAfro‐Caribbean EA = East Asian NA = Native American LT = Latino PI = Pacific Islander SA = South Asian

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

EW 76403 0 0 1

AA 0 0 2682 0

EA 0 6394 0 0

NA 0 0 0 674

LT 0 0 0 4855

PI 0 92 0 0

SA 461 0 0 0

EWAA 0 0 123 0

EWEA 38 538 0 0

EWNA 91 0 0 2463

EWLT 12 0 0 1561

EWPI 8 45 0 0

EWSA 44 0 0 0

AAEA 0 0 32 0

AANA 0 0 100 0

AALT 0 0 3 113

AAPI 0 0 4 0

AASA 0 0 12 0

EANA 0 7 0 0

EALT 0 2 0 108

EAPI 0 40 0 0

EASA 2 15 0 0

NALT 0 0 0 130

NAPI 0 2 0 0

LTPI 0 0 0 14

LTSA 0 0 0 8

PISA 0 10 0 0

EWAAEA 0 0 6 0

EWAANA 0 0 115 0

EWAALT 0 0 0 23

EWAAPI 0 0 1 0

EWAASA 0 0 2 0

EWEANA 2 0 0 30

EWEALT 0 1 0 52

EWEAPI 0 36 0 0

EWNALT 0 0 0 199

EWNAPI 0 0 0 2

EWLTPI 2 0 0 6

EWLTSA 0 0 0 4

EWPISA 0 1 0 0

AAEANA 0 0 8 0

AAEALT 0 0 0 8

AAEAPI 0 0 1 0

AAEASA 0 0 3 0

AANALT 0 0 1 1

AALTSA 0 0 1 6

EANALT 0 0 0 9

EALTPI 0 0 0 2

EAPISA 0 1 0 0

NALTPI 0 0 0 3

Total 77063 7184 3094 10272

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 27: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

16 SI Y Banda et al

Table S3 RaceEthnicity derived from KP administrative databases versus genotyping array used for those with missing or mis‐scanned self‐report data

RaceEthnicity Genotyping Array

EUR EAS AFR LAT

White 2115 267 2140 55

African American 0 0 99 3

Hispanic 5 0 0 255

Asian 5 183 123 1

OtherUncertain 6 11 43 24

Total 2131 461 2405 338

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 28: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 17 SI

Table S4 Numbers of SNPs remaining after LD and structural variation locus pruning for each of the eight different PCAs

PCA run Number of SNPs after pruning

EuropeanWest Asian and Ashkenazi 38050

European only 37967

South Asian 38823

East Asian 38077

Pacific Islander 38833

African American 39849

Latino 38365

All GERA 38301

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 29: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

18 SI Y Banda et al

Table S5 Individual ancestral admixture proportions for subjects run on the LAT array

Nationality Ancestral admixture proportion ()

African European Native American

Mexican 23 plusmn 50 670 plusmn 145 307 plusmn 135

Central‐South American 55 plusmn 108 654 plusmn 179 291 plusmn 163

Puerto‐Rican 123 plusmn 141 755 plusmn 148 124 plusmn 76

Cuban 127 plusmn 205 797 plusmn 200 76 plusmn 81

LAT mean 44 plusmn 79 665 plusmn 158 286 plusmn 148

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 30: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 19 SI

Table S6 Distribution of genetic ancestry by self‐reported raceethnicity A particular genetic ancestry was assigned to an individual if at least 5 of that individualrsquos ancestry was estimated from that group RaceEthnicity abbreviations as in Table S2 Genetic ancestry abbreviations are the same except for AF which represents sub‐Saharan African Race‐

Ethnicity Genetic Ancestry

EW AF EA NA SA EW

AF

EW

EA

EW

NA

EW

SA

AF

EA

AF

SA

EA

NA

EA

PI

EA

SA

PI

SA

EW

AF

EA

EW

AF

NA

EW

AF

SA

EW

EA

NA

EW

EA

PI

EW

EA

SA

EW

NA

SA

EW

PI

SA

AF

EA

PI

AF

EA

SA

AF

PI

SA

EA

PI

SA

All

EW 71992 191 243 646 3208 1 27 9 11 9 24 38 1 1 76401

AA 230 2349 5 1 7 9 28 46 1 1 1 1 2679

EA 4835 2 108 23 1267 21 4 8 93 4 1 23 6389

NA 555 5 3 67 11 1 1 7 2 10 12 674

LT 232 1 1 1 34 3094 4 1 1 1 2 1294 1 33 107 4807

PI 4 3 19 1 13 3 4 32 1 12 92

SA 1 15 256 1 1 132 1 24 16 2 3 1 7 460

EWAA 107 2 3 10 1 123

EWEA 32 16 431 1 7 3 7 29 46 572

EWNA 2248 16 2 206 34 1 4 11 3 23 2548

EWLT 404 25 5 929 17 1 84 1 9 89 1564

EWPI 8 7 1 1 30 1 48

EWSA 14 25 2 1 2 44

AAEA 1 1 1 1 7 1 16 1 29

AANA 91 1 4 3 99

AALT 26 43 1 40 1 2 1 114

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 31: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

20 SI Y Banda et al

AAPI 1 1 1 1 4

AASA 1 6 1 1 1 1 1 1 13

EANT 2 1 3 6

EALT 15 10 6 5 3 1 54 1 95

EAPI 18 1 11 1 8 1 40

EASA 12 3 1 1 17

NALT 3 1 1 98 1 17 3 5 129

NAPI 1 1

LTPI 3 5 2 1 1 12

LTSA 1 1 3 1 2 2 10

PISA 5 1 2 8

EWAAEA 1 2 2 5

EWAANA 1 107 1 4 2 115

EWAALT 1 9 5 5 1 1 1 23

EWAASA 1 1

EWAAPI 1 1

EWEANA 2 20 1 5 2 2 32

EWEALT 2 19 4 2 19 2 48

EWEAPI 17 2 1 15 35

EWNALT 34 2 135 2 10 1 14 198

EWNAPI 1 1 2

EWLTPI 3 2 1 1 1 8

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 32: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 21 SI

EWLTSA 2 1 1 4

EWPISA 1 1

AAEANA 1 1 4 1 1 8

AAEALT 1 1 2 4

AAEASA 2 1 3

AANALT 1 1 2

AALTSA 1 3 2 1 7

EANALT 8 8

EALTPI 1 1 2

EAPISA 1 1

NALTPI 2 1 3

Total 75532 235 4920 1 279 2971 894 5260 3436 9 10 26 1308 52 25 52 1540 81 194 219 91 300 1 2 1 2 48 97489

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 33: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

22 SI Y Banda et al

Table S7 Distribution of genetic ancestry by raceethnicity as reported in the KP electronic health records for those with missing or mis‐scanned self‐report raceethnicity Abbreviations as in Table S6

Race‐Ethnicity EW AF EA SA EW

AF EA NA

EA PI

EW EA

EW NA

EW SA

EA SA

SA PI

EW AF EA

EW AF NA

EW AF SA

EW EA NA

EW EA PI

EW EA SA

EW SA NA

EW SA PI

EA SA PI

Total

White 4302 0 0 0 25 0 0 33 68 131 0 0 0 3 2 3 2 3 2 1 0 4575

Afr Am 0 6 0 0 92 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 102

Asian 4 0 218 8 0 1 41 18 0 2 3 1 1 0 0 1 4 3 0 0 6 311

Latino 37 0 3 0 3 0 0 2 151 0 0 0 1 44 1 5 0 0 8 0 0 255

Other‐uncertain

34 0 3 0 6 0 0 15 8 2 1 0 2 2 1 3 4 0 1 0 2 84

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 34: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 23 SI

Table S8 Distribution of continental genetic ancestry as a function of self‐reported raceethnicity

Race‐Ethnicity Genetic Ancestry

One Continent Two Continents Three Continents All Continental Distribution

Number Number Number Number 1 Continent 2 Continents 3 Continents

One EW 71992 942 4288 56 121 02 76401 100 EW 75 EWSA

15 EWNA

AA 230 86 2362 882 87 32 2679 100 AF 99 AFEW

LT 235 49 3135 652 1437 299 4807 99 EW 99 EWNA 90 EWNAAF

EA 4837 757 1419 222 133 21 6389 100 EA 89 EAPI

8 EAEW

PI 7 76 40 435 45 489 92 48 PIEW

33 PIEA

71 PIEWEA

SA 272 591 175 381 13 28 460 94 SA 75 SAEW

14 SAEA

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 35: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

24 SI Y Banda et al

NA 555 823 87 129 32 48 674 100 EW 77 EWNA

All 78128 854 11506 126 1868 20 91502

of

Total

939

Two All 2791 510 2140 391 545 100 5476 97 EW

60 EWNA

22 EWEA

13 EWAF

29 EWNAAF

23 EWSANA

17 EWEANA

of

Total

56

Three All 48 94 345 675 118 231 511 90 EW 45 EWNA

36 EWAF

19 EWEA

31 EWEANA

20 EWAFNA

16 EWSANA

of

Total

05

All All 80967 831 13991 144 2531 26 97489

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 36: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 25 SI

Table S9 First degree relatives organized by self‐reported raceethnicity

MZ pair

White African American Latino Asian OtherUncertain

White 28 0 0 0 0

African American 0 0 0 0 0

Latino 0 0 2 0 0

Asian 0 0 0 4 0

OtherUncertain 0 0 0 0 0

Parent (column) ndash Offspring (row)

White African American Latino Asian OtherUncertain

White 3044 1 23 6 21

African American 8 54 0 1 1

Latino 122 6 175 3 0

Asian 35 2 0 205 1

OtherUncertain 32 0 1 0 0

Full sibs

White African American Latino Asian OtherUncertain

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 37: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

26 SI Y Banda et al

White 1531 2 33 5 30

African American 45 7 0 0

Latino 155 2 2

Asian 205 1

OtherUncertain 0

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 38: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 27 SI

Table S10 RaceEthnicity and Genetic Ancestry for Sib Pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Sib 1 RaceEthnicity Sib 2 RaceEthnicity Genetic Ancestry Number

Both Sibs Self‐Report

EW NA EW 26

EW LT EW 6

EW LT EWNA 1

EW AA EWAF 1

EW EWLT EW 15

EW EWLT EWNA 6

EW EWLT EWAFNA 1

EW EWAALT EWAF 1

EW EWEA EW 3

EW EWEA EWEA 1

EW EWSA EW 1

EWNA NA EW 1

EWNA NA EWNA 1

EWNA NA EWAF 1

EWNA EWNALT EW 1

EWLT NA EW 1

EWLT AALTSA EWAF 1

EWLT EWAANALT EWAF 1

EWEA LT EWNA 1

EWAANALTSA LT EWNA 1

EWLTPI PI EWEAPI 1

LT AALT EWNA 3

LT AALT EWAFNA 1

One Sib Self‐Report One Sib EHR

EW Latino EW 1

EWLT OtherUncertain EW 1

LT White EW 1

LTEA OtherUncertain EWEAPI 1

EAPI OtherUncertain EA 1

Both Sibs EHR

White Latino EWNA 1

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 39: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

28 SI Y Banda et al

Table S11 RaceEthnicity and Genetic Ancestry for Parent‐Child pairs Discordant for RaceEthnicity Abbreviations as in Table S6

Parent RaceEthnicity Child RaceEthnicity

Parent Genetic Ancestry

Child Genetic Ancestry

Number

Parent and Child Self‐Report

EW EWAA EW EWAF 4

EW EWAANA EW EWAF 1

EW EWEA EW EW 3

EW EWEA EW EWEA 19

EW EWEA EW EWEASA 3

EW EWEA EWSA EW 1

EW EWEA EWSA EWEA 1

EW EWLT EW EW 18

EW EWLT EW EWNA 41

EW EWLT EW EWSA 2

EW EWLT EW EWNASA 2

EW EWLT EWNA EW 2

EW EWLT EWNA EWNA 5

EW EWLT EWNA EWNA 1

EW EWLT EWSA EWNA 1

EW EWLT EWSA EWNASA 1

EW EWLT EW EWEA 1

EW EWLT EW EWEANA 1

EW EWLT EWNA EWAFEA 1

EW EWEALT EW EWEA 1

EW EWEALT EW EWEANA 1

EW EWEANA EW EWEA 1

EW EWEANALT EW EWEA 1

EW EWEAPI EW EWEAPI 1

EW EWNALT EW EWAF 1

EW EWNALT EW EWNA 2

EW EWNALT EW EWNASA 1

EW EWPI EW EWEA 1

EW EWPI EW EWEAPI 1

EW AA EW EWAF 1

EW AALT EW EWNA 1

EW EA EW EWEA 1

EW LT EW EW 6

EW LT EW EWNA 9

EW LT EW EWNASA 2

EW LT EWNA EWNA 3

EW LT EW EWAFNA 1

EW NA EW EW 24

EW NA EWSA EW 1

EW NA EWSA EWSA 1

EWAA EWAA EW EWAF 1

EWEA EW EW EW 3

EWLT EW EW EW 6

EWLT EW EW EWNA 1

EWLT EW EWEA EWEA 1

EWLT EW EWNA EW 3

EWLT EW EWNA EWNA 1

EWEA EWLT EW EW 1

EWEA EW EWEA EWEA 1

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1

Page 40: Characterizing Race/Ethnicity and Genetic Ancestry … › ... › genetics › 200 › 4 › 1285.full.pdfSelf-reported African Americans and Latinos show ed extensive European and

Y Banda et al 29 SI

EWEA EWAAPI EWEAPI EWAFEA 1

EWNA NA EWNA EWNA 1

EWNA EWLT EW EWNA 2

EWNA NALT EW EWNA 1

EWNA EWEALT EW EWEA 1

EWNA EWEANA EWNASA EWEANA 2

EWAANA EWNA EWAFSA EWAFEA 1

EWEANAPI EWEALT EWEA EWEA 1

EWEAPI EW EWEAPI EWEA 1

EWNALT EW EW EW 1

EWNALT EW EW EWSA 1

LT EW EW EW 3

LT EW EWNA EWNA 1

LT EW EWNA EWNASA 1

LT EW EWAFNA EWNA 1

LT EWNA EWAFSA EWAFNASA 1

NA EW EW EW 18

NALT EW EWNA EW 1

NA EWNA EW EW 1

AALT LT EWAFNA EWAFNA 2

AALT LT EWNA EWNASA 1

AASA EWLT EASA EWSA 1

AAEALT EWLT EWNA EWNA 1

AAEASA SA PISA PISA 1

AALTSA LT EWNA EWNA 1

EA EW EWEA EWEA 1

EA EALT EWNA EWEA 1

EA EWEA EW EWEA 1

EA EWEALT EW EWEA 1

Parent Self‐Report Child EHR

EW OtherUncertain EW EW 1

EW OtherUncertain EW EWNASA 1

EW Latino EW EWNA 3

EWLT OtherUncertain EW EW 1

AA Asian EWAF EWAFEA 1

EA Latino EA EA 1

NA Asian EWNA EWEANA 1

Parent EHR Child Self‐Report

Latino EW EW EW 1

OtherUncertain EW EW EW 2

White EWLT EW EW 2

White EWLT EW EWNA 3

White EWLT EWNA EW 1

White EWNALT EW EW 1

White EWNALT EW EWNA 1

White NA EW EW 3

White EWEA EW EWEA 1

OtherUncertain EWAALT EWAF EWAF 1

White EWEALT EW EWEA 1