13
Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project Page 1 by Steven James Coker 5 December 2017 A. Introduction. Genetic genealogy provides information that helps guide and validate research conducted by genealogists. This paper will examine clade grouping for The Coker DNA Project based on Y-DNA Simple Tandem Repeat (STR) data. STRs are also referred to as Short Tandem Repeats. The Coker DNA Project at Family Tree DNA (FTDNA) is administered by Beth Gay and the author. The project includes autosomal, mtDNA, and Y-DNA test data. The Y-DNA data in the project includes both STR and Single Nucleotide Polymorphism (SNP) data. STR data used for this report is identified by the sample number and Terminal SNP for the sample. The International Society of Genetic Genealogists defines “Terminal SNP” as the defining SNP of the latest subclade known by current research. We can conclude that men who share the same terminal SNP fall into a common clade. A clade is group that descends from a common ancestor. The common ancestor could be ancient or recent. With the current state of knowledge, for most Terminal SNPs one can’t tell how recently the shared ancestor lived. STRs are often more useful for determining the closeness in time to the common ancestor. Genealogical lineages for STR samples shown in this report are posted in the Coker DNA Project at https://www.worldfamilies.net/surnames/coker/pats The generally accepted explanation for the etymology of the Coker surname is that it is a locative name. Taken from the area where we find the villages of East Coker and West Coker in Somerset, UK. For locative names, one expects to find a variety of genetically distinct Y-DNA lineages with the surname. This is generally true because various unrelated families adopt the surname based on having at some time lived in, or near, the location. By unrelated, I mean in the sense that they do not share a common direct male ancestor for many hundreds, more likely thousands, of years. They may in fact be related, although descended from different male lineages. It is beyond the scope of this paper to explore in detail the claimed genealogies of the men in the DNA Project. There is a common pattern which is interesting and consistent with the findings given here. Many genealogies for Coker men in this study have that their Coker families came from England. Often into Virginia in the 1600’s or early 1700’s. However, other Coker families are known to have migrated at various times to other ports and countries. Thus, the genealogical background for this study presents a picture of a variety of unrelated families. Unrelated in the sense of not sharing a common direct male ancestor. What does the Y-DNA show? B. Statement of Hypothesis. The hypothesis which will be examined is that the STR data compiled in the Coker DNA Project can be taken as evidence that the Coker surname includes various distinct clades.

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 1

by Steven James Coker 5 December 2017

A. Introduction. Genetic genealogy provides information that helps guide and validate research conducted by genealogists. This paper will examine clade grouping for The Coker DNA Project based on Y-DNA Simple Tandem Repeat (STR) data. STRs are also referred to as Short Tandem Repeats. The Coker DNA Project at Family Tree DNA (FTDNA) is administered by Beth Gay and the author. The project includes autosomal, mtDNA, and Y-DNA test data. The Y-DNA data in the project includes both STR and Single Nucleotide Polymorphism (SNP) data. STR data used for this report is identified by the sample number and Terminal SNP for the sample. The International Society of Genetic Genealogists defines “Terminal SNP” as the defining SNP of the latest subclade known by current research. We can conclude that men who share the same terminal SNP fall into a common clade. A clade is group that descends from a common ancestor. The common ancestor could be ancient or recent. With the current state of knowledge, for most Terminal SNPs one can’t tell how recently the shared ancestor lived. STRs are often more useful for determining the closeness in time to the common ancestor. Genealogical lineages for STR samples shown in this report are posted in the Coker DNA Project at https://www.worldfamilies.net/surnames/coker/pats The generally accepted explanation for the etymology of the Coker surname is that it is a locative name. Taken from the area where we find the villages of East Coker and West Coker in Somerset, UK. For locative names, one expects to find a variety of genetically distinct Y-DNA lineages with the surname. This is generally true because various unrelated families adopt the surname based on having at some time lived in, or near, the location. By unrelated, I mean in the sense that they do not share a common direct male ancestor for many hundreds, more likely thousands, of years. They may in fact be related, although descended from different male lineages. It is beyond the scope of this paper to explore in detail the claimed genealogies of the men in the DNA Project. There is a common pattern which is interesting and consistent with the findings given here. Many genealogies for Coker men in this study have that their Coker families came from England. Often into Virginia in the 1600’s or early 1700’s. However, other Coker families are known to have migrated at various times to other ports and countries. Thus, the genealogical background for this study presents a picture of a variety of unrelated families. Unrelated in the sense of not sharing a common direct male ancestor. What does the Y-DNA show? B. Statement of Hypothesis. The hypothesis which will be examined is that the STR data compiled in the Coker DNA Project can be taken as evidence that the Coker surname includes various distinct clades.

Page 2: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 2

C. Methods. 1. DNA collection. The Coker DNA project was started in 2005 and has grown slowly as more interested people have joined. The project currently has 96 members who have submitted Y-DNA samples. Of those, 55 have been identified as descendants of men with the Coker surname. The remainder include descendants from other lineages. As well as men who descended from a Coker female and thus did not inherit the Coker Y-DNA. The DNA project includes data on STRs and SNPs for Y-DNA. The project also includes mtDNA and autosomal data. This analysis will be limited to Y-DNA data. 2. Data Testing. The DNA samples for this project have been collected and tested by the FTDNA company in Houston, Texas. All samples and data are managed, stored, and reported by FTDNA. The data is available for public viewing at the following web site.

https://www.familytreedna.com/groups/Coker/dna‐results 

However, some data may be restricted from view by the DNA owner. FTDNA does not take ownership of the data from individual samples. The person who submits the sample controls viewing access for data from their sample. All data is accessible to the project administrator. The writer is administrator for the project.

Example of Individual Report of STR Data.

Page 3: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 3

3. Data. For this paper, I used Y-DNA STR data. I further restricted the data set to only samples that had tested 111 markers. Lastly, I used only samples in the R Haplogroup with Coker surname ancestors in clades with at least two members. The result of these filters left 36 subjects with data samples for study. 5. Phylogenetic Methodology. The Y-DNA STR data was compiled into an Excel spreadsheet from which a FASTA data file was created. The FASTA data was imported into MEGA. The data was aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of the aligned data using both the Neighbor-Joining and the Maximum Likelihood methods. Similar results were obtained from both methods. 6. For estimating the Genetic Distance and Time to Most Recent Common Ancestor (TMRCA), I used the program by McGee at http://www.mymcgee.com/tools/yutility111.html D. Discussion and Procedures. 1. Converting STR Counts to ATCG. The STR results are reported by FTDNA as repetition counts. For this analysis, I wanted to convert the counts into ATCG nucleotide sequences for which the counts are a user-friendly shorthand. This turns out to be much harder than expected. FTDNA does not publish or provide information for converting the counts into ATCG format. After considerable research, the best conversions I found were given in the NIST table shown below. The repeat motifs in the table were sufficient for many of the STRs. But, some of the STRs were not included in the data table. There were 102 STR data marker values available in the data set. Including multi-value markers. I found usable ATCG conversions for 72 markers. Thus, the analysis was limited to 72 STR markers.

Page 4: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 4

http://www.cstl.nist.gov/biotech/strbase/ystr_fact.htm

2. Pruning Data. A total of 96 Y-DNA STR data samples exist in the project. Of these, 65 had tested fully to 111 markers. I found that including 111 markers with 37 or 67 markers in MEGA gave some illogical results in the phylogenetic analysis. Therefore, I eliminated the 37 and 67 marker samples and worked only with the 65 samples that were tested to 111 markers. Seven of the 111 marker samples were not in the Y-DNA R Haplogroup. Running the MEGA phylogenetic analysis with both R and the non-R haplogroup types included resulted in trees that had logical flaws. Also, the non-R samples were not found to have any multiple member clades. Thus, I eliminated the data samples that were of non-R haplogroup. The objective of the study is to examine data relative to the Coker surname. Thus, I eliminated samples from men who did not have a Coker surname direct male ancestor.

Page 5: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 5

I also eliminated six samples of Coker surname men that were not members of any multiple member clades. These samples were all genealogically unique in the project. They would have added no value to this clade analysis. 3. Aligning the Data in MEGA. I ran various simulations in MEGA. I found that the most logical phylogenetic result was obtained when the data was aligned using MUSCLE with Neighbor Joining as the clustering method.

Alignment Procedure

Page 6: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 6

E. Results 1. Phylogenetic Analysis in MEGA. I tried various phylogenetic methods in MEGA. The results obtained using either Neighbor Joining or Maximum Likelihood methods appeared equivalent. In the tree shown below, the data samples are numbered with the sample ID number followed by the Haplogroup and Terminal SNP.

Phylogenetic Analysis by Neighbor-Joining

Page 7: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 7

Tree Using Neighbor Joining

Page 8: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 8

2. Genetic Distance and Time to Most Recent Common Ancestor (TMRCA) The program provided by Dean McGee was used for estimating the Genetic Distance and TMRCA. http://www.mymcgee.com/tools/yutility111.html

Page 9: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 9

 

The diagonal elements of the table indicate the number of allele data existing for that haplotype. The calculations can be of two types: Hybrid Mutation Model The target of this model is to match that used by Ysearch and FTDNA. It uses the stepwise mutation model for all alleles except DYS464 and YCA which use the infinite allele model. The stepwise model says that each mutation is allowed to change the allele value by exactly one, so a difference of two means that two mutations occurred and a difference of three means that three mutations occurred.

Page 10: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 10

Infinite Allele Mutation Model The infinite allele model says that the entire difference between allele values, no matter how large, is the result of one mutation. Both models were run and found to give very similar genetic distance results. The genetic distance calculations show that men within each identified clade are related. And that men within sub-clade R-A2073 are more closely related to each other than to those in clade R-A2076 that are not R-A2073.

Based on Genetic Distance

12 Markers

25 Markers

37 Markers

67 Markers

111 Markers

Very Tightly Related

N/A N/A 0 0 0

Tightly Related N/A N/A 1 1-2 1-2

Related 0 0-1 2-3 3-4 3-5

Probably Related (within 15

generations) 1 2 4 5-6 6-7

Possibly Related (maybe over 15

generations) 2 3 5 7 8-10

Not Related >2 >3 >5 >7 >10

Source: https://www.familytreedna.com/learn/y-dna-testing/y-str/expected-relationship-match/

Page 11: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 11

The above table gives the 90% probability that the MRCA was no longer than the specified generations. The algorithm was taken from Bruce Walsh paper, https://www.ncbi.nlm.nih.gov/pubmed/11404350 Values on the diagonal indicate number of markers tested. Generations assume 30 years on average per generation. Calculations use the average mutation rate for all the markers common between the pair of haplotypes being compared. Assumed mutation rates are shown in the following table. It is not unexpected that some samples have a larger TMRCA value and some smaller. This may be caused by variability of mutations and differences in length of average generations between lineages. Also, the assumed mutation rates are based on population studies and actual mutation rates for this group may be somewhat different.

Page 12: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 12

It is apparent that the TMRCA results for sub-clade R-A2073 are not highly accurate. This can be confirmed by examining known relationships of samples within the data. For example, sample 141217, the author, is brother to samples B5592 and B7315, nephew to sample 197959, and 1st cousin to B7314. The estimated TMRCA for these relationships are 6, 7, 7, and 9 generations. The actual TMRCA values are much lower. True relationships are known and confirmed by autosomal and mtDNA tests. The true relationships are known for all men in sub-clade R-A2073. Generally, the true TMRCA values are much lower in sub-clade R-A2073 than the estimated values. The TMRCA are probably overstated in the overall larger clade R-A2076. However, the relative magnitude of the TMRCA results appears reasonable. This over-estimation of TMRCA values is most likely due to the assumed mutation rates used. For most of the STRs in this study, no published mutation rates are known. For these STRs a mutation rate of 0.0024 was used as explained below. Based on the results, we can conclude that the true mutation rates for the R-A2076 are different from the rates used here.

Mutation Rates Used

The default constant mutation rate is 0.0024 mutations/allele/generation which represents the 60 total mutations during 24870 total allele meioses as given in Y-Chromosomal Microsatellite Mutation Rates: Differences in Mutation Rate Between and Within Loci by B.Myhre Dupuy, M.Stenerson, T.Egeland, and B.Olaisen; Human Mutation 23:117-124 (2004). The second mutation rate selection uses the FTDNA derived mutation rates. This includes a rate of 0.00399 for the first 12 markers, 0.00481 for markers 13 through 25, and 0.00748 for the markers 26 through 37. Values not included in the FTDNA rates use the default rate of 0.0024.

Page 13: Analysis of Y-DNA Simple Tandem Repeats (STRs) For The ...€¦ · aligned in MEGA using MUSCLE with Neighbor Joining clustering method. Phylogenetic analysis was done in MEGA of

Analysis of Y-DNA Simple Tandem Repeats (STRs) For The Coker DNA Project

Page 13

F. Conclusions. Based on the analysis discussed above, I conclude that The Coker DNA Project has identified several clades of men with different direct male ancestors. Furthermore, these clades do not share a common direct male ancestor in genealogical time (less than 800 yrs). Further, that Clade R-A2076 has an identified sub-clade of R-A2073. Based on genealogies it seems likely that this sub-clade separated from the main clade about 250-400 years before present. This time estimate appears consistent with the results shown here. To improve and further this work, I would suggest the following. - Discover the nucleotide repeat motifs for all 111 markers. I was unable to find some of these values. - Find better estimates of mutation rates for the clades in this study. - Include analysis of SNP formation dates.

F. References. Coker DNA Project https://www.familytreedna.com/groups/Coker/dna-results Y-DNA Comparison Utility by Dean McGee http://www.mymcgee.com/tools/yutility111.html Summary List of Y Chromosome STR Loci and Available Fact Sheets http://www.cstl.nist.gov/biotech/strbase/ystr_fact.htm Family Tree DNA https://www.familytreedna.com/learn/y-dna-testing/y-str/expected-relationship-match/ Y-chromosomal microsatellite mutation rates: differences in mutation rate between and within loci. by B.Myhre Dupuy, M.Stenerson, T.Egeland, and B.Olaisen; Human Mutation 23:117-124 (2004). https://www.ncbi.nlm.nih.gov/pubmed/14722915 Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. by Bruce Walsh Ph.D. https://www.ncbi.nlm.nih.gov/pubmed/11404350