Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

1/5

Phylogenetic and Phylogeographic Ancestral

Inference of the Brahmin Population in Uttar

Pradesh

Rohit Chandra

Dept. of CSE IIIT Delhi

[email protected]

Suhanth Boddu

Dept. of CSE, IIIT Delhi

[email protected]

Divyanshu Bansal

Dept. of CSE, IIIT Delhi

[email protected]

AbstractIndia is one of the most ethnically diverse regions ofthe world and is considered to be a major paleolithic migrationroute. India is home to several autochthonous subhaplogroupsof the haplogroup N which is one of the signature haplogroupsthat define out of Africa migrations. Here we try to study thephylogeny of the Brahmin population of the north Indian stateof Uttar Pradesh using the theory of coalescence, haplotype treesand the infinite alleles model on our mtDNA dataset. We wouldbe using some well defined and established methods with newassumptions and modifications to suit our dataset. Based onthe observed results we would also try to discuss the possiblephylogeography of the population. We would come up withthe time to the most recent common maternal ancestor of thepopulation under study and also comment on the possible placeof origin and migration paths of this ancestor. This study usesmitochondrial DNA control region sequence data for analysis.

Key words: Mitochondrial DNA (mtDNA), coalescent, hap-

logroup, haplotype, phylogeny, Brahmin, phylogeny tree, D-

Loop, hypervariable regions (HVR), control region, single

nucleotide polymorphism (SNP).

I. INTRODUCTIONUttar Pradesh is the largest and fifth largest state of India in

terms of population and area respectively. With a population

of 200 million it is one of the most diverse regions in terms

of mtDNA haplogroups as evident from the fact that we were

able to identify 25 different haplogroups in our small dataset

of 50 sequences. Roughly 12 % of UPs population consists

of Brahmins and this would be the population of interest for

our study.

Mitochondrial DNA in humans contains about 16500 base

pairs and codes for 13 proteins, 22 tRNA genes and 2 rRNA

genes. Mitochondrial DNA is inherited exclusively from the

mother which makes it ideal for studying maternal lineage

far back in time. With respect to phylogeny studies a partof the mtDNA called the D-Loop is of vital importance.

D-loop is a DNA structure where the two strands of a

double-stranded DNA molecule are separated for a stretch

and held apart by a third strand of DNA. The D-Loop forms a

part of the non-coding control region of mtDNA and consists

of 1100 base pairs. It sits between bp 16001-574 of the

circular genome. Contained within this D-Loop are 2 highly

polymorphic regions known as hypervariable regions. HVR1

is located between bp 16001-16560 while HVR2 is located

in bp 1-574. Hypervariable regions are ideal for studying

ancestral relationships among organisms as they are highly

polymorphic. We will be focussing on HVR1 region for our

study.

Mutations in the mitochondrial DNA result in substitution of

one of the bases A, C, G or T by another one, these mutations

can be of two type namely transversion or transition betweenpurines (A, G) and pyrimidine (C, T) and transitions or

mutations within purines and pyrimidines.With regards to

ancestral inference it makes sense to study the mutation

patterns among different individuals and use these patterns to

obtain results. While making the distance matrices for further

analysis we only include the sites of mutations among the

individuals and remove the rest of the sequence just to make

things less complex. These sites are called segregation sites

and individuals having the same mutation patterns make a

lineage. It is safe to assume that people belonging to the same

lineage based on HVR regions will have the same ancestral

history regardless of mutations in other regions of mtDNA

I I . METHODOLOGY

A. Dataset

The dataset presented here corresponds to a subset of the

data collected by Palanichamy et al. [2] from the states of Uttar

Pradesh, West Bengal and Andhra Pradesh. From this dataset

47 sequences were identified that pertained to the Brahmin

population of Uttar Pradesh. The dataset is publically available

on the NCBI website with the accession numbers AY713976

- AY714050.The infinitely many sites model[1] assumes each

polymorphism to be a unique mutation. Since the HVR regions

of the individuals varied we took the region lying between bp

16024 - 16569 in order to maintain uniformity of the data.This gave a constant figure of 546 bases for all the sequences.

It was ensured that no mutation site was left unaccounted for

by this adjustment. These filtered sequences were then aligned

using clustalw multiple sequence alignment in order to get the

segregation sites. The same was done for the HVR2 sequences

as well. The sequences with the number of transition difference

less than or equal to 2 were grouped together under a lineage.

However the sequences with a single transversion difference

were considered distinct. 36 lineages were identified and this


2/5

narrowed down the data to 36 sequences of 546 base pairs.

After the identification of the segregation sites the sequences

were aligned to the revised Cambridge reference sequence

(rCRS) to get the mutation patterns. These patterns were

graphed and 65 different site patterns emerged. These 36

sequences of 65 letters each made the final version of the

dataset.The mtDNA haplogroups in the dataset are H, W, I,

J3, R, T2, K1, U3, U7, HV, V and their subhaplogroups. The

data is further divided into Bhargava, Chaturvedi and other

Brahmins on the basis of surnames.

B. Making sense of the data - The Tree

The Analyses of Phylogenetics and Evolution (APE) pack-

age [3] in R provides convenient methods for creating and

analysing genealogical trees. Phangorn and Phyclust packages

are useful in creating distance matrices and unrooted trees.

Using the unrooted tree as reference various rooted trees can

be created and analysed. In order to arrive at one tree for

analysis we select the most parsimonious construction for the

unrooted tree.

Fig. 1. Unrooted genealogical tree for the mtDNA data

21 refers to sequence AY714021

The sequence data is used for creating an Euclidean distance

matrix representing the measure of genetic distance between

pairs of sequences. The UPGMA (Unweighted pair group

method with arithmetic mean) method clusters the two closest

sequences then computes the new distance matrix using the

arithmetic mean to the first cluster. This process is repeated

until all the sequences are grouped. A variation of the UPGMA

method called the neighbour joining method [5] pulls out a

pair of sequences at each iteration so that the total length of

the branches on the tree is minimized After a pair of nodes

is pulled out, it forms a cluster in the tree and is included in

further rounds of iteration. Here also a new distance matrix is

generated at each iteration. These methods help us in arriving

at the tree with optimal parsimony which can be used for

MRCA studies.

C. The Most Recent Common Ancestor

The infinite alleles model suggests that each mutation

creates new allele and allele types are all equally different

from each other. This model can be applied while calculating

the time to most recent common ancestor for two individuals

(two genes to be accurate). This model takes into account

mutation rate, time, sample size and the number of matching

markers [14]. Two sequences differing by a single mutation

will have a match on 1050 markers [546 of HVR1 and 504

of HVR2] for our dataset. The chi-squared test comes in

handy here as it can be used to verify the tmrca and get

a confidence interval for our estimates. The probability that

two lineages coalesce in the immediately preceding generation

is the probability that they share a parental DNA sequence.

In a diploid population with a constant effective population

size with 2Ne copies of each locus, there are 2Ne potential

parents in the previous generation, so the probability that

two alleles share a parent is 1/(2Ne) and correspondingly, the

probability that they do not coalesce is 1 1/(2Ne). So at each

successive preceding generation the probability of coalescence

is geometrically distributed and given by the forumula

P(t) = (1 1/2Ne)t1(1/2Ne)

Ewens sampling formula gives us an estimate of how many

different alleles are observed a given number of times in the

sample. This can come in handly while trying to gain anestimate on the tmrca in different regions of the cladogram.

We can take the substitutions in different branches to follow

a poisson process with the rate of /2 at each branch. Whilescoring for the tmrca we give a higher score to transversions

and a lower one to a transition since a transition means

a closer ancestral relationship as a transversion signifies a

greater mutation. The p-value obtained by the chi-squared

test helps in getting a confidence interval estimate for our

tmrca calculations.

Parsons et al.[4] suggest methods for mapping substitutions

and suggest a mutation rate value of 3105 per nucleotidetransmission event for D-loop HVR regions. Two of these

transmission events correspond to one generation and this isthe value used for getting the time to most recent common

ancestor. For our study we have fixed a period of 15 years

for one transmission event. As we will see this gives a

50% confidence interval on our calculations. Average mtDNA

haplotype mutation rate of 3 105 gives us the periodfor which a region of the control region remains unchanged

- 3 105 1050 = .0315 per transmission event. Thissuggests that a combined HVR1 + HVR2 region of mtDNAcan survive for 32 generations or 960 years (based on our


3/5

fixed transmission event time). A similar estimate for HVR1

region gives a figure of 62 generations. Thus the common

female ancestor (MRCA) of two people who randomly match

exactly for the combined HVR1+HVR2 mtDNA haplotype can

go back to a period of almost 1000 years.

D. Resources

R Packages - APE [3], Phangorn, Phyclust, rgl, igraph

ClustalW multiple sequence alignment

Python

Snapgene Viewer

NCBI genome browser

mtDB Human Mitcohondrial Genome Database [13]

(www.mtdb.igp.uu.se)

III. RESULTS AND DISCUSSION

With the unrooted tree for the 36 sequences we can arrive

at multiple rooted graphs. However as described above we

prefer the most parsimonious reconstruction and use distance

based neighbour joining to create the tree. First step towards

this tree is performing the clustering which can be fed as

input for the optimal parsimonious reconstruction.

Fig. 2. Cluster dendrogram for the dataset

The MPR function [10] can now serve us to provide the

phylogeny tree with optimal parsimony which can facilitateour MRCA analysis. We magnify subregions in this tree to

perform the mrca calculations. For exemplifying the method

we would be taking the sequences lying on the two extreme

ends of the tree and perform the mrca analysis. The tree gives

us the sequence AY714025 to be the closest relative of the

common ancestor of the individuals included in the dataset,

as we move rightwards the number of mutations increase

but there are stronger intra-region relationships observed

suggesting similar mutation patterns among individuals. Since

we are analysing a group of autochthonous haplogroups

this observation also suggests common geographical origins

and migrations. We know haplogroup L3 in Africa to be

the ancestor of the macrohaplogroups M and N which

further branched out into J, K, I, W, R and U in Europe

and South-East Asia about 40000-70000 years ago. In the

TMRCA calculation a perfect match on all the markers

gives suggests that the common maternal ancestor of these

individuals may go as far as 32 generations. Since our dataset

with 65 segregation sites has the mutation difference between

sequences limited to at most 8 we can rule out the origin of

the most recent common ancestor outside of the subcontinent.

The sequence AY714025 which makes the leftmost branch

of the tree belongs to the mtDNA haplogroup U2 which is

found in Punjab in Pakistan in addition to the northern and

eastern states of India, as the maximum transmission events

to the most recent common ancestor in our study reaches a

figure of 500 we can safely assume the origins of the mrca

to be in one of the above mentioned regions. For the MRCA

calculations the markers include mutations in HVR2 region

in addition to the HVR1 mutations.

Fig. 3. Optimal parsimony tree for the dataset. Optimal parsimoniousreconstruction returns a parsimony value of 107


4/5

The data obtained from the right branch of the tree pertains

to the haplogroups U7, K1, K8, H13, HV2 and V. This dataset

has 11 segregation sites. A distance matrix created out of all

the sequences pertaining to these haplogroups was constructed

using the K80 model [15] which takes into account two kinds

of mutations and allots separate probabilities to them hence

giving a better bound than the default Jukes and Cantor

model.

Fig. 4. Distance matrix created using the K80 model.

A normal distribution is plotted to gain an estimate ofthe transmission events between 2 sequences by taking

the sequences 2 at a time from each branch of the tree.

As described above 2 transmission events are taken to be

one generation and each transmission events add a value

of 15 years to the time to most recent common ancestor.

Given below is the normal distribution obtained by taking

the sequences of a Brahmin of haplogroup K1 lying at the

rightmost edge of the tree and a Chaturvedi of haplogroup

V corresponding to accession numbers AY714004 and

AY713979 respectively. The normal distribution does not

show any rise till reaching the value of 87, this is expected

since the calculations performed on our model showed that

the control region can survive upto 62 transmission eventswithout any mutations. The normal distribution reaches a

maxima at 472 suggesting a gap of 236 generations between

the most recent common ancestor of the two individuals.

A similar kind of calculation done for the whole tree returns

a result of 536 transmission events for the MRCA. Taking

into account our assumptions regarding the transmission

events this returns a value of 8040. This enables us to

conclude with a 50% confidence interval that the most

recent maternal ancestor of the population under study

existed about 8000 years ago. It has been established that

the split from haplogroup L3 and expansion of haplogroups

M and N started occuring about 40000-7000 years ago.Among the haplogroups studied haplogroup R has the oldest

origins going back to 65,000 years while T is the youngest

which is estimated to have lived in the region around the

Mediterranean Sea around 17,000 years ago which suggests

a western migration from the subcontinent through middle

east[11]. Keeping these figures in mind it is highly unlikely

that the MRCA of the population under study with origins

going back to 8000 years came from outside the subcontinent.

Fig. 5. The normal distribution for estimation of TMRCA of AY714004 andAY713979

Fig. 6. Magnified sub-portion of the rightmost branch depicting mutationsand the number of transmission events.

It is important to note that most of the studied haplogroups

are subgroups of macrohaplogroup N which is found scarcely

in the Dravidian and North Eastern population in India.

Haplogroup M is the dominant haplogroup among these

populations. Out of Africa migrations suggest the originsof the split of haplogroup N to be in North Western part

of the Indian subcontinent. Mitochondrial DNA migration

patterns suggest that the descendants of haplogroup N had

expanded into Europe and developed new subhaplgroups

about 8000-17000 years ago [11]. So we can conclude that the

most recent maternal ancestor of the population under study

originated not more than 8000 years ago in North-Western or

Central India.


5/5

The descendants of haplogroup R in India namely the

haplogroups J, T and group R0 show high haplogroup

diversity suggesting their autochthonous status[6]. South Asia

lies in the way of earliest dispersals out of Africa and is

vital for phylogeography studies in the Western European

population [11]. Although the spread of Haplogroup R is

very wide it is unlikely that any subhaplogroup in Europe of

Central Asia (migration routes out of India) has descended

from the most recent common ancestor of the population

under study.

IV. FUTURE WOR K

The present study has been undertaken using the infinite

alleles model using only a subset of the sequences sequenced

by Palanichamy et al. from Uttar Pradesh, Andhra Pradesh,

West Bengal and North Eastern States. We would like to

undertake a similar study on the whole dataset using the

stepwise mutation model which tries to better account for

the actual mutational process that occurs at microsatellite

markers scoring the marker lengths. The stepwise mutationmodel looks at the frequency spectrum of the mismatches,

namely how many loci show no mismatches, 1 mismatch, 2

mismatches and so on. We would focus on the relationships

between the different authochthonous haplogroups and

subhaplogroups in different parts of India. We also want to

study the subcontinent specific phylogeography and migration

patterns of the autochthonous subhaplogroups.

During the course of the project we had developed a python

program for identifying the mutation sites in the dataset.

It takes as input a file containing sequences in phylip or

fasta format, loads the data into a matrix and searches for

dissimilar bases at indexes. The indexes were numbered

16024-16569 for HVR1 region. It aligns the mutation sites tothe Cambridge Reference Sequence (rCRS) and returns the

number and indexes of the segregation sites as its output. We

plan to come up with a GUI based version of this program.

We plan to call it MutaSeg.

REFERENCES

[1] Griffiths, R. C., and Simon Tavare. Ancestral inference in populationgenetics Statistical Science (1994): 307-319.

[2] Palanichamy, et al. Phylogeny of Mitochondrial DNA Macrohaplogroup Nin India, Based on Complete Sequencing: Implications for the Peoplingof South Asia. The American Journal of Human Genetics, 75 (2004),966-978

[3] Paradis E., Claude J. and Strimmer K. 2004. APE: analyses of phyloge-netics and evolution in R language. Bioinformatics 20: 289-290.

[4] Parsons, Thomas J., et al. A high observed substitution rate in the humanmitochondrial DNA control region Nature genetics 15.4 (1997): 363-368.

[5] Saitou, Naruya, and Masatoshi Nei. The neighbor-joining method: anew method for reconstructing phylogenetic trees. Molecular biology andevolution 4.4 (1987): 406-425.

[6] Karmin, Monika. Human mitochondrial DNA haplogroup R in India:dissecting the phylogenetic tree of South Asian-specific lineages. Diss.2005.

[7] Rosenberg, Noah A., and Magnus Nordborg. Genealogical trees, coales-cent theory and the analysis of genetic polymorphisms. Nature ReviewsGenetics 3.5 (2002): 380-390.

[8] van Oven M, Kayser M. 2009. Updated comprehensive phylogenetic treeof global human mitochondrial DNA variation. Hum Mutat 30(2):E386-E394.

[9] Metspalu, Mait, et al. Most of the extant mtDNA boundaries in south andsouthwest Asia were likely shaped during the initial settlement of Eurasiaby anatomically modern humans. BMC genetics 5.1 (2004): 26.

[10] Narushima, H. and Hanazawa, M. A more efficient algorithm forMPR problems in phylogeny. Discrete Applied Mathematics, 80 (1997),231238.

[11] Richards, Martin B., et al. Phylogeography of mitochondrial DNA in

western Europe. Annals of human genetics 62.3 (1998): 241-260.[12] Maji, Suvendu, S. Krithika, and T. S. Vasulu. Phylogeographic distri-

bution of mitochondrial DNA macrohaplogroup M in India. Journal ofgenetics 88.1 (2009): 127-139.

[13] Ingman, M. and Gyllensten, U. Human Mitochondrial GenomeDatabase, a resource for population genetics and medical sciences.Nucleic Acids Res 34, D749-D751 (2006).

[14] Walsh, Bruce. Estimating the time to the most recent common ancestorfor the Y chromosome or mitochondrial DNA for a pair of individuals.Genetics 158.2 (2001): 897-912.

[15] Kimura, Motoo. A simple method for estimating evolutionary rates ofbase substitutions through comparative studies of nucleotide sequences.Journal of molecular evolution 16.2 (1980): 111-120.

Documents

Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh