Upload
rohit-chandra
View
213
Download
0
Embed Size (px)
Citation preview
7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh
1/5
Phylogenetic and Phylogeographic Ancestral
Inference of the Brahmin Population in Uttar
Pradesh
Rohit Chandra
Dept. of CSE IIIT Delhi
Suhanth Boddu
Dept. of CSE, IIIT Delhi
Divyanshu Bansal
Dept. of CSE, IIIT Delhi
AbstractIndia is one of the most ethnically diverse regions ofthe world and is considered to be a major paleolithic migrationroute. India is home to several autochthonous subhaplogroupsof the haplogroup N which is one of the signature haplogroupsthat define out of Africa migrations. Here we try to study thephylogeny of the Brahmin population of the north Indian stateof Uttar Pradesh using the theory of coalescence, haplotype treesand the infinite alleles model on our mtDNA dataset. We wouldbe using some well defined and established methods with newassumptions and modifications to suit our dataset. Based onthe observed results we would also try to discuss the possiblephylogeography of the population. We would come up withthe time to the most recent common maternal ancestor of thepopulation under study and also comment on the possible placeof origin and migration paths of this ancestor. This study usesmitochondrial DNA control region sequence data for analysis.
Key words: Mitochondrial DNA (mtDNA), coalescent, hap-
logroup, haplotype, phylogeny, Brahmin, phylogeny tree, D-
Loop, hypervariable regions (HVR), control region, single
nucleotide polymorphism (SNP).
I. INTRODUCTIONUttar Pradesh is the largest and fifth largest state of India in
terms of population and area respectively. With a population
of 200 million it is one of the most diverse regions in terms
of mtDNA haplogroups as evident from the fact that we were
able to identify 25 different haplogroups in our small dataset
of 50 sequences. Roughly 12 % of UPs population consists
of Brahmins and this would be the population of interest for
our study.
Mitochondrial DNA in humans contains about 16500 base
pairs and codes for 13 proteins, 22 tRNA genes and 2 rRNA
genes. Mitochondrial DNA is inherited exclusively from the
mother which makes it ideal for studying maternal lineage
far back in time. With respect to phylogeny studies a partof the mtDNA called the D-Loop is of vital importance.
D-loop is a DNA structure where the two strands of a
double-stranded DNA molecule are separated for a stretch
and held apart by a third strand of DNA. The D-Loop forms a
part of the non-coding control region of mtDNA and consists
of 1100 base pairs. It sits between bp 16001-574 of the
circular genome. Contained within this D-Loop are 2 highly
polymorphic regions known as hypervariable regions. HVR1
is located between bp 16001-16560 while HVR2 is located
in bp 1-574. Hypervariable regions are ideal for studying
ancestral relationships among organisms as they are highly
polymorphic. We will be focussing on HVR1 region for our
study.
Mutations in the mitochondrial DNA result in substitution of
one of the bases A, C, G or T by another one, these mutations
can be of two type namely transversion or transition betweenpurines (A, G) and pyrimidine (C, T) and transitions or
mutations within purines and pyrimidines.With regards to
ancestral inference it makes sense to study the mutation
patterns among different individuals and use these patterns to
obtain results. While making the distance matrices for further
analysis we only include the sites of mutations among the
individuals and remove the rest of the sequence just to make
things less complex. These sites are called segregation sites
and individuals having the same mutation patterns make a
lineage. It is safe to assume that people belonging to the same
lineage based on HVR regions will have the same ancestral
history regardless of mutations in other regions of mtDNA
I I . METHODOLOGY
A. Dataset
The dataset presented here corresponds to a subset of the
data collected by Palanichamy et al. [2] from the states of Uttar
Pradesh, West Bengal and Andhra Pradesh. From this dataset
47 sequences were identified that pertained to the Brahmin
population of Uttar Pradesh. The dataset is publically available
on the NCBI website with the accession numbers AY713976
- AY714050.The infinitely many sites model[1] assumes each
polymorphism to be a unique mutation. Since the HVR regions
of the individuals varied we took the region lying between bp
16024 - 16569 in order to maintain uniformity of the data.This gave a constant figure of 546 bases for all the sequences.
It was ensured that no mutation site was left unaccounted for
by this adjustment. These filtered sequences were then aligned
using clustalw multiple sequence alignment in order to get the
segregation sites. The same was done for the HVR2 sequences
as well. The sequences with the number of transition difference
less than or equal to 2 were grouped together under a lineage.
However the sequences with a single transversion difference
were considered distinct. 36 lineages were identified and this
7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh
2/5
narrowed down the data to 36 sequences of 546 base pairs.
After the identification of the segregation sites the sequences
were aligned to the revised Cambridge reference sequence
(rCRS) to get the mutation patterns. These patterns were
graphed and 65 different site patterns emerged. These 36
sequences of 65 letters each made the final version of the
dataset.The mtDNA haplogroups in the dataset are H, W, I,
J3, R, T2, K1, U3, U7, HV, V and their subhaplogroups. The
data is further divided into Bhargava, Chaturvedi and other
Brahmins on the basis of surnames.
B. Making sense of the data - The Tree
The Analyses of Phylogenetics and Evolution (APE) pack-
age [3] in R provides convenient methods for creating and
analysing genealogical trees. Phangorn and Phyclust packages
are useful in creating distance matrices and unrooted trees.
Using the unrooted tree as reference various rooted trees can
be created and analysed. In order to arrive at one tree for
analysis we select the most parsimonious construction for the
unrooted tree.
Fig. 1. Unrooted genealogical tree for the mtDNA data
21 refers to sequence AY714021
The sequence data is used for creating an Euclidean distance
matrix representing the measure of genetic distance between
pairs of sequences. The UPGMA (Unweighted pair group
method with arithmetic mean) method clusters the two closest
sequences then computes the new distance matrix using the
arithmetic mean to the first cluster. This process is repeated
until all the sequences are grouped. A variation of the UPGMA
method called the neighbour joining method [5] pulls out a
pair of sequences at each iteration so that the total length of
the branches on the tree is minimized After a pair of nodes
is pulled out, it forms a cluster in the tree and is included in
further rounds of iteration. Here also a new distance matrix is
generated at each iteration. These methods help us in arriving
at the tree with optimal parsimony which can be used for
MRCA studies.
C. The Most Recent Common Ancestor
The infinite alleles model suggests that each mutation
creates new allele and allele types are all equally different
from each other. This model can be applied while calculating
the time to most recent common ancestor for two individuals
(two genes to be accurate). This model takes into account
mutation rate, time, sample size and the number of matching
markers [14]. Two sequences differing by a single mutation
will have a match on 1050 markers [546 of HVR1 and 504
of HVR2] for our dataset. The chi-squared test comes in
handy here as it can be used to verify the tmrca and get
a confidence interval for our estimates. The probability that
two lineages coalesce in the immediately preceding generation
is the probability that they share a parental DNA sequence.
In a diploid population with a constant effective population
size with 2Ne copies of each locus, there are 2Ne potential
parents in the previous generation, so the probability that
two alleles share a parent is 1/(2Ne) and correspondingly, the
probability that they do not coalesce is 1 1/(2Ne). So at each
successive preceding generation the probability of coalescence
is geometrically distributed and given by the forumula
P(t) = (1 1/2Ne)t1(1/2Ne)
Ewens sampling formula gives us an estimate of how many
different alleles are observed a given number of times in the
sample. This can come in handly while trying to gain anestimate on the tmrca in different regions of the cladogram.
We can take the substitutions in different branches to follow
a poisson process with the rate of /2 at each branch. Whilescoring for the tmrca we give a higher score to transversions
and a lower one to a transition since a transition means
a closer ancestral relationship as a transversion signifies a
greater mutation. The p-value obtained by the chi-squared
test helps in getting a confidence interval estimate for our
tmrca calculations.
Parsons et al.[4] suggest methods for mapping substitutions
and suggest a mutation rate value of 3105 per nucleotidetransmission event for D-loop HVR regions. Two of these
transmission events correspond to one generation and this isthe value used for getting the time to most recent common
ancestor. For our study we have fixed a period of 15 years
for one transmission event. As we will see this gives a
50% confidence interval on our calculations. Average mtDNA
haplotype mutation rate of 3 105 gives us the periodfor which a region of the control region remains unchanged
- 3 105 1050 = .0315 per transmission event. Thissuggests that a combined HVR1 + HVR2 region of mtDNAcan survive for 32 generations or 960 years (based on our
7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh
3/5
fixed transmission event time). A similar estimate for HVR1
region gives a figure of 62 generations. Thus the common
female ancestor (MRCA) of two people who randomly match
exactly for the combined HVR1+HVR2 mtDNA haplotype can
go back to a period of almost 1000 years.
D. Resources
R Packages - APE [3], Phangorn, Phyclust, rgl, igraph
ClustalW multiple sequence alignment
Python
Snapgene Viewer
NCBI genome browser
mtDB Human Mitcohondrial Genome Database [13]
(www.mtdb.igp.uu.se)
III. RESULTS AND DISCUSSION
With the unrooted tree for the 36 sequences we can arrive
at multiple rooted graphs. However as described above we
prefer the most parsimonious reconstruction and use distance
based neighbour joining to create the tree. First step towards
this tree is performing the clustering which can be fed as
input for the optimal parsimonious reconstruction.
Fig. 2. Cluster dendrogram for the dataset
The MPR function [10] can now serve us to provide the
phylogeny tree with optimal parsimony which can facilitateour MRCA analysis. We magnify subregions in this tree to
perform the mrca calculations. For exemplifying the method
we would be taking the sequences lying on the two extreme
ends of the tree and perform the mrca analysis. The tree gives
us the sequence AY714025 to be the closest relative of the
common ancestor of the individuals included in the dataset,
as we move rightwards the number of mutations increase
but there are stronger intra-region relationships observed
suggesting similar mutation patterns among individuals. Since
we are analysing a group of autochthonous haplogroups
this observation also suggests common geographical origins
and migrations. We know haplogroup L3 in Africa to be
the ancestor of the macrohaplogroups M and N which
further branched out into J, K, I, W, R and U in Europe
and South-East Asia about 40000-70000 years ago. In the
TMRCA calculation a perfect match on all the markers
gives suggests that the common maternal ancestor of these
individuals may go as far as 32 generations. Since our dataset
with 65 segregation sites has the mutation difference between
sequences limited to at most 8 we can rule out the origin of
the most recent common ancestor outside of the subcontinent.
The sequence AY714025 which makes the leftmost branch
of the tree belongs to the mtDNA haplogroup U2 which is
found in Punjab in Pakistan in addition to the northern and
eastern states of India, as the maximum transmission events
to the most recent common ancestor in our study reaches a
figure of 500 we can safely assume the origins of the mrca
to be in one of the above mentioned regions. For the MRCA
calculations the markers include mutations in HVR2 region
in addition to the HVR1 mutations.
Fig. 3. Optimal parsimony tree for the dataset. Optimal parsimoniousreconstruction returns a parsimony value of 107
7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh
4/5
The data obtained from the right branch of the tree pertains
to the haplogroups U7, K1, K8, H13, HV2 and V. This dataset
has 11 segregation sites. A distance matrix created out of all
the sequences pertaining to these haplogroups was constructed
using the K80 model [15] which takes into account two kinds
of mutations and allots separate probabilities to them hence
giving a better bound than the default Jukes and Cantor
model.
Fig. 4. Distance matrix created using the K80 model.
A normal distribution is plotted to gain an estimate ofthe transmission events between 2 sequences by taking
the sequences 2 at a time from each branch of the tree.
As described above 2 transmission events are taken to be
one generation and each transmission events add a value
of 15 years to the time to most recent common ancestor.
Given below is the normal distribution obtained by taking
the sequences of a Brahmin of haplogroup K1 lying at the
rightmost edge of the tree and a Chaturvedi of haplogroup
V corresponding to accession numbers AY714004 and
AY713979 respectively. The normal distribution does not
show any rise till reaching the value of 87, this is expected
since the calculations performed on our model showed that
the control region can survive upto 62 transmission eventswithout any mutations. The normal distribution reaches a
maxima at 472 suggesting a gap of 236 generations between
the most recent common ancestor of the two individuals.
A similar kind of calculation done for the whole tree returns
a result of 536 transmission events for the MRCA. Taking
into account our assumptions regarding the transmission
events this returns a value of 8040. This enables us to
conclude with a 50% confidence interval that the most
recent maternal ancestor of the population under study
existed about 8000 years ago. It has been established that
the split from haplogroup L3 and expansion of haplogroups
M and N started occuring about 40000-7000 years ago.Among the haplogroups studied haplogroup R has the oldest
origins going back to 65,000 years while T is the youngest
which is estimated to have lived in the region around the
Mediterranean Sea around 17,000 years ago which suggests
a western migration from the subcontinent through middle
east[11]. Keeping these figures in mind it is highly unlikely
that the MRCA of the population under study with origins
going back to 8000 years came from outside the subcontinent.
Fig. 5. The normal distribution for estimation of TMRCA of AY714004 andAY713979
Fig. 6. Magnified sub-portion of the rightmost branch depicting mutationsand the number of transmission events.
It is important to note that most of the studied haplogroups
are subgroups of macrohaplogroup N which is found scarcely
in the Dravidian and North Eastern population in India.
Haplogroup M is the dominant haplogroup among these
populations. Out of Africa migrations suggest the originsof the split of haplogroup N to be in North Western part
of the Indian subcontinent. Mitochondrial DNA migration
patterns suggest that the descendants of haplogroup N had
expanded into Europe and developed new subhaplgroups
about 8000-17000 years ago [11]. So we can conclude that the
most recent maternal ancestor of the population under study
originated not more than 8000 years ago in North-Western or
Central India.
7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh
5/5
The descendants of haplogroup R in India namely the
haplogroups J, T and group R0 show high haplogroup
diversity suggesting their autochthonous status[6]. South Asia
lies in the way of earliest dispersals out of Africa and is
vital for phylogeography studies in the Western European
population [11]. Although the spread of Haplogroup R is
very wide it is unlikely that any subhaplogroup in Europe of
Central Asia (migration routes out of India) has descended
from the most recent common ancestor of the population
under study.
IV. FUTURE WOR K
The present study has been undertaken using the infinite
alleles model using only a subset of the sequences sequenced
by Palanichamy et al. from Uttar Pradesh, Andhra Pradesh,
West Bengal and North Eastern States. We would like to
undertake a similar study on the whole dataset using the
stepwise mutation model which tries to better account for
the actual mutational process that occurs at microsatellite
markers scoring the marker lengths. The stepwise mutationmodel looks at the frequency spectrum of the mismatches,
namely how many loci show no mismatches, 1 mismatch, 2
mismatches and so on. We would focus on the relationships
between the different authochthonous haplogroups and
subhaplogroups in different parts of India. We also want to
study the subcontinent specific phylogeography and migration
patterns of the autochthonous subhaplogroups.
During the course of the project we had developed a python
program for identifying the mutation sites in the dataset.
It takes as input a file containing sequences in phylip or
fasta format, loads the data into a matrix and searches for
dissimilar bases at indexes. The indexes were numbered
16024-16569 for HVR1 region. It aligns the mutation sites tothe Cambridge Reference Sequence (rCRS) and returns the
number and indexes of the segregation sites as its output. We
plan to come up with a GUI based version of this program.
We plan to call it MutaSeg.
REFERENCES
[1] Griffiths, R. C., and Simon Tavare. Ancestral inference in populationgenetics Statistical Science (1994): 307-319.
[2] Palanichamy, et al. Phylogeny of Mitochondrial DNA Macrohaplogroup Nin India, Based on Complete Sequencing: Implications for the Peoplingof South Asia. The American Journal of Human Genetics, 75 (2004),966-978
[3] Paradis E., Claude J. and Strimmer K. 2004. APE: analyses of phyloge-netics and evolution in R language. Bioinformatics 20: 289-290.
[4] Parsons, Thomas J., et al. A high observed substitution rate in the humanmitochondrial DNA control region Nature genetics 15.4 (1997): 363-368.
[5] Saitou, Naruya, and Masatoshi Nei. The neighbor-joining method: anew method for reconstructing phylogenetic trees. Molecular biology andevolution 4.4 (1987): 406-425.
[6] Karmin, Monika. Human mitochondrial DNA haplogroup R in India:dissecting the phylogenetic tree of South Asian-specific lineages. Diss.2005.
[7] Rosenberg, Noah A., and Magnus Nordborg. Genealogical trees, coales-cent theory and the analysis of genetic polymorphisms. Nature ReviewsGenetics 3.5 (2002): 380-390.
[8] van Oven M, Kayser M. 2009. Updated comprehensive phylogenetic treeof global human mitochondrial DNA variation. Hum Mutat 30(2):E386-E394.
[9] Metspalu, Mait, et al. Most of the extant mtDNA boundaries in south andsouthwest Asia were likely shaped during the initial settlement of Eurasiaby anatomically modern humans. BMC genetics 5.1 (2004): 26.
[10] Narushima, H. and Hanazawa, M. A more efficient algorithm forMPR problems in phylogeny. Discrete Applied Mathematics, 80 (1997),231238.
[11] Richards, Martin B., et al. Phylogeography of mitochondrial DNA in
western Europe. Annals of human genetics 62.3 (1998): 241-260.[12] Maji, Suvendu, S. Krithika, and T. S. Vasulu. Phylogeographic distri-
bution of mitochondrial DNA macrohaplogroup M in India. Journal ofgenetics 88.1 (2009): 127-139.
[13] Ingman, M. and Gyllensten, U. Human Mitochondrial GenomeDatabase, a resource for population genetics and medical sciences.Nucleic Acids Res 34, D749-D751 (2006).
[14] Walsh, Bruce. Estimating the time to the most recent common ancestorfor the Y chromosome or mitochondrial DNA for a pair of individuals.Genetics 158.2 (2001): 897-912.
[15] Kimura, Motoo. A simple method for estimating evolutionary rates ofbase substitutions through comparative studies of nucleotide sequences.Journal of molecular evolution 16.2 (1980): 111-120.