Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

Embed Size (px)

Citation preview

  • 7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

    1/5

    Phylogenetic and Phylogeographic Ancestral

    Inference of the Brahmin Population in Uttar

    Pradesh

    Rohit Chandra

    Dept. of CSE IIIT Delhi

    [email protected]

    Suhanth Boddu

    Dept. of CSE, IIIT Delhi

    [email protected]

    Divyanshu Bansal

    Dept. of CSE, IIIT Delhi

    [email protected]

    AbstractIndia is one of the most ethnically diverse regions ofthe world and is considered to be a major paleolithic migrationroute. India is home to several autochthonous subhaplogroupsof the haplogroup N which is one of the signature haplogroupsthat define out of Africa migrations. Here we try to study thephylogeny of the Brahmin population of the north Indian stateof Uttar Pradesh using the theory of coalescence, haplotype treesand the infinite alleles model on our mtDNA dataset. We wouldbe using some well defined and established methods with newassumptions and modifications to suit our dataset. Based onthe observed results we would also try to discuss the possiblephylogeography of the population. We would come up withthe time to the most recent common maternal ancestor of thepopulation under study and also comment on the possible placeof origin and migration paths of this ancestor. This study usesmitochondrial DNA control region sequence data for analysis.

    Key words: Mitochondrial DNA (mtDNA), coalescent, hap-

    logroup, haplotype, phylogeny, Brahmin, phylogeny tree, D-

    Loop, hypervariable regions (HVR), control region, single

    nucleotide polymorphism (SNP).

    I. INTRODUCTIONUttar Pradesh is the largest and fifth largest state of India in

    terms of population and area respectively. With a population

    of 200 million it is one of the most diverse regions in terms

    of mtDNA haplogroups as evident from the fact that we were

    able to identify 25 different haplogroups in our small dataset

    of 50 sequences. Roughly 12 % of UPs population consists

    of Brahmins and this would be the population of interest for

    our study.

    Mitochondrial DNA in humans contains about 16500 base

    pairs and codes for 13 proteins, 22 tRNA genes and 2 rRNA

    genes. Mitochondrial DNA is inherited exclusively from the

    mother which makes it ideal for studying maternal lineage

    far back in time. With respect to phylogeny studies a partof the mtDNA called the D-Loop is of vital importance.

    D-loop is a DNA structure where the two strands of a

    double-stranded DNA molecule are separated for a stretch

    and held apart by a third strand of DNA. The D-Loop forms a

    part of the non-coding control region of mtDNA and consists

    of 1100 base pairs. It sits between bp 16001-574 of the

    circular genome. Contained within this D-Loop are 2 highly

    polymorphic regions known as hypervariable regions. HVR1

    is located between bp 16001-16560 while HVR2 is located

    in bp 1-574. Hypervariable regions are ideal for studying

    ancestral relationships among organisms as they are highly

    polymorphic. We will be focussing on HVR1 region for our

    study.

    Mutations in the mitochondrial DNA result in substitution of

    one of the bases A, C, G or T by another one, these mutations

    can be of two type namely transversion or transition betweenpurines (A, G) and pyrimidine (C, T) and transitions or

    mutations within purines and pyrimidines.With regards to

    ancestral inference it makes sense to study the mutation

    patterns among different individuals and use these patterns to

    obtain results. While making the distance matrices for further

    analysis we only include the sites of mutations among the

    individuals and remove the rest of the sequence just to make

    things less complex. These sites are called segregation sites

    and individuals having the same mutation patterns make a

    lineage. It is safe to assume that people belonging to the same

    lineage based on HVR regions will have the same ancestral

    history regardless of mutations in other regions of mtDNA

    I I . METHODOLOGY

    A. Dataset

    The dataset presented here corresponds to a subset of the

    data collected by Palanichamy et al. [2] from the states of Uttar

    Pradesh, West Bengal and Andhra Pradesh. From this dataset

    47 sequences were identified that pertained to the Brahmin

    population of Uttar Pradesh. The dataset is publically available

    on the NCBI website with the accession numbers AY713976

    - AY714050.The infinitely many sites model[1] assumes each

    polymorphism to be a unique mutation. Since the HVR regions

    of the individuals varied we took the region lying between bp

    16024 - 16569 in order to maintain uniformity of the data.This gave a constant figure of 546 bases for all the sequences.

    It was ensured that no mutation site was left unaccounted for

    by this adjustment. These filtered sequences were then aligned

    using clustalw multiple sequence alignment in order to get the

    segregation sites. The same was done for the HVR2 sequences

    as well. The sequences with the number of transition difference

    less than or equal to 2 were grouped together under a lineage.

    However the sequences with a single transversion difference

    were considered distinct. 36 lineages were identified and this

  • 7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

    2/5

    narrowed down the data to 36 sequences of 546 base pairs.

    After the identification of the segregation sites the sequences

    were aligned to the revised Cambridge reference sequence

    (rCRS) to get the mutation patterns. These patterns were

    graphed and 65 different site patterns emerged. These 36

    sequences of 65 letters each made the final version of the

    dataset.The mtDNA haplogroups in the dataset are H, W, I,

    J3, R, T2, K1, U3, U7, HV, V and their subhaplogroups. The

    data is further divided into Bhargava, Chaturvedi and other

    Brahmins on the basis of surnames.

    B. Making sense of the data - The Tree

    The Analyses of Phylogenetics and Evolution (APE) pack-

    age [3] in R provides convenient methods for creating and

    analysing genealogical trees. Phangorn and Phyclust packages

    are useful in creating distance matrices and unrooted trees.

    Using the unrooted tree as reference various rooted trees can

    be created and analysed. In order to arrive at one tree for

    analysis we select the most parsimonious construction for the

    unrooted tree.

    Fig. 1. Unrooted genealogical tree for the mtDNA data

    21 refers to sequence AY714021

    The sequence data is used for creating an Euclidean distance

    matrix representing the measure of genetic distance between

    pairs of sequences. The UPGMA (Unweighted pair group

    method with arithmetic mean) method clusters the two closest

    sequences then computes the new distance matrix using the

    arithmetic mean to the first cluster. This process is repeated

    until all the sequences are grouped. A variation of the UPGMA

    method called the neighbour joining method [5] pulls out a

    pair of sequences at each iteration so that the total length of

    the branches on the tree is minimized After a pair of nodes

    is pulled out, it forms a cluster in the tree and is included in

    further rounds of iteration. Here also a new distance matrix is

    generated at each iteration. These methods help us in arriving

    at the tree with optimal parsimony which can be used for

    MRCA studies.

    C. The Most Recent Common Ancestor

    The infinite alleles model suggests that each mutation

    creates new allele and allele types are all equally different

    from each other. This model can be applied while calculating

    the time to most recent common ancestor for two individuals

    (two genes to be accurate). This model takes into account

    mutation rate, time, sample size and the number of matching

    markers [14]. Two sequences differing by a single mutation

    will have a match on 1050 markers [546 of HVR1 and 504

    of HVR2] for our dataset. The chi-squared test comes in

    handy here as it can be used to verify the tmrca and get

    a confidence interval for our estimates. The probability that

    two lineages coalesce in the immediately preceding generation

    is the probability that they share a parental DNA sequence.

    In a diploid population with a constant effective population

    size with 2Ne copies of each locus, there are 2Ne potential

    parents in the previous generation, so the probability that

    two alleles share a parent is 1/(2Ne) and correspondingly, the

    probability that they do not coalesce is 1 1/(2Ne). So at each

    successive preceding generation the probability of coalescence

    is geometrically distributed and given by the forumula

    P(t) = (1 1/2Ne)t1(1/2Ne)

    Ewens sampling formula gives us an estimate of how many

    different alleles are observed a given number of times in the

    sample. This can come in handly while trying to gain anestimate on the tmrca in different regions of the cladogram.

    We can take the substitutions in different branches to follow

    a poisson process with the rate of /2 at each branch. Whilescoring for the tmrca we give a higher score to transversions

    and a lower one to a transition since a transition means

    a closer ancestral relationship as a transversion signifies a

    greater mutation. The p-value obtained by the chi-squared

    test helps in getting a confidence interval estimate for our

    tmrca calculations.

    Parsons et al.[4] suggest methods for mapping substitutions

    and suggest a mutation rate value of 3105 per nucleotidetransmission event for D-loop HVR regions. Two of these

    transmission events correspond to one generation and this isthe value used for getting the time to most recent common

    ancestor. For our study we have fixed a period of 15 years

    for one transmission event. As we will see this gives a

    50% confidence interval on our calculations. Average mtDNA

    haplotype mutation rate of 3 105 gives us the periodfor which a region of the control region remains unchanged

    - 3 105 1050 = .0315 per transmission event. Thissuggests that a combined HVR1 + HVR2 region of mtDNAcan survive for 32 generations or 960 years (based on our

  • 7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

    3/5

    fixed transmission event time). A similar estimate for HVR1

    region gives a figure of 62 generations. Thus the common

    female ancestor (MRCA) of two people who randomly match

    exactly for the combined HVR1+HVR2 mtDNA haplotype can

    go back to a period of almost 1000 years.

    D. Resources

    R Packages - APE [3], Phangorn, Phyclust, rgl, igraph

    ClustalW multiple sequence alignment

    Python

    Snapgene Viewer

    NCBI genome browser

    mtDB Human Mitcohondrial Genome Database [13]

    (www.mtdb.igp.uu.se)

    III. RESULTS AND DISCUSSION

    With the unrooted tree for the 36 sequences we can arrive

    at multiple rooted graphs. However as described above we

    prefer the most parsimonious reconstruction and use distance

    based neighbour joining to create the tree. First step towards

    this tree is performing the clustering which can be fed as

    input for the optimal parsimonious reconstruction.

    Fig. 2. Cluster dendrogram for the dataset

    The MPR function [10] can now serve us to provide the

    phylogeny tree with optimal parsimony which can facilitateour MRCA analysis. We magnify subregions in this tree to

    perform the mrca calculations. For exemplifying the method

    we would be taking the sequences lying on the two extreme

    ends of the tree and perform the mrca analysis. The tree gives

    us the sequence AY714025 to be the closest relative of the

    common ancestor of the individuals included in the dataset,

    as we move rightwards the number of mutations increase

    but there are stronger intra-region relationships observed

    suggesting similar mutation patterns among individuals. Since

    we are analysing a group of autochthonous haplogroups

    this observation also suggests common geographical origins

    and migrations. We know haplogroup L3 in Africa to be

    the ancestor of the macrohaplogroups M and N which

    further branched out into J, K, I, W, R and U in Europe

    and South-East Asia about 40000-70000 years ago. In the

    TMRCA calculation a perfect match on all the markers

    gives suggests that the common maternal ancestor of these

    individuals may go as far as 32 generations. Since our dataset

    with 65 segregation sites has the mutation difference between

    sequences limited to at most 8 we can rule out the origin of

    the most recent common ancestor outside of the subcontinent.

    The sequence AY714025 which makes the leftmost branch

    of the tree belongs to the mtDNA haplogroup U2 which is

    found in Punjab in Pakistan in addition to the northern and

    eastern states of India, as the maximum transmission events

    to the most recent common ancestor in our study reaches a

    figure of 500 we can safely assume the origins of the mrca

    to be in one of the above mentioned regions. For the MRCA

    calculations the markers include mutations in HVR2 region

    in addition to the HVR1 mutations.

    Fig. 3. Optimal parsimony tree for the dataset. Optimal parsimoniousreconstruction returns a parsimony value of 107

  • 7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

    4/5

    The data obtained from the right branch of the tree pertains

    to the haplogroups U7, K1, K8, H13, HV2 and V. This dataset

    has 11 segregation sites. A distance matrix created out of all

    the sequences pertaining to these haplogroups was constructed

    using the K80 model [15] which takes into account two kinds

    of mutations and allots separate probabilities to them hence

    giving a better bound than the default Jukes and Cantor

    model.

    Fig. 4. Distance matrix created using the K80 model.

    A normal distribution is plotted to gain an estimate ofthe transmission events between 2 sequences by taking

    the sequences 2 at a time from each branch of the tree.

    As described above 2 transmission events are taken to be

    one generation and each transmission events add a value

    of 15 years to the time to most recent common ancestor.

    Given below is the normal distribution obtained by taking

    the sequences of a Brahmin of haplogroup K1 lying at the

    rightmost edge of the tree and a Chaturvedi of haplogroup

    V corresponding to accession numbers AY714004 and

    AY713979 respectively. The normal distribution does not

    show any rise till reaching the value of 87, this is expected

    since the calculations performed on our model showed that

    the control region can survive upto 62 transmission eventswithout any mutations. The normal distribution reaches a

    maxima at 472 suggesting a gap of 236 generations between

    the most recent common ancestor of the two individuals.

    A similar kind of calculation done for the whole tree returns

    a result of 536 transmission events for the MRCA. Taking

    into account our assumptions regarding the transmission

    events this returns a value of 8040. This enables us to

    conclude with a 50% confidence interval that the most

    recent maternal ancestor of the population under study

    existed about 8000 years ago. It has been established that

    the split from haplogroup L3 and expansion of haplogroups

    M and N started occuring about 40000-7000 years ago.Among the haplogroups studied haplogroup R has the oldest

    origins going back to 65,000 years while T is the youngest

    which is estimated to have lived in the region around the

    Mediterranean Sea around 17,000 years ago which suggests

    a western migration from the subcontinent through middle

    east[11]. Keeping these figures in mind it is highly unlikely

    that the MRCA of the population under study with origins

    going back to 8000 years came from outside the subcontinent.

    Fig. 5. The normal distribution for estimation of TMRCA of AY714004 andAY713979

    Fig. 6. Magnified sub-portion of the rightmost branch depicting mutationsand the number of transmission events.

    It is important to note that most of the studied haplogroups

    are subgroups of macrohaplogroup N which is found scarcely

    in the Dravidian and North Eastern population in India.

    Haplogroup M is the dominant haplogroup among these

    populations. Out of Africa migrations suggest the originsof the split of haplogroup N to be in North Western part

    of the Indian subcontinent. Mitochondrial DNA migration

    patterns suggest that the descendants of haplogroup N had

    expanded into Europe and developed new subhaplgroups

    about 8000-17000 years ago [11]. So we can conclude that the

    most recent maternal ancestor of the population under study

    originated not more than 8000 years ago in North-Western or

    Central India.

  • 7/30/2019 Phylogenetic and Phylogeographic Ancestral Inference of the Brahmin Population in Uttar Pradesh

    5/5

    The descendants of haplogroup R in India namely the

    haplogroups J, T and group R0 show high haplogroup

    diversity suggesting their autochthonous status[6]. South Asia

    lies in the way of earliest dispersals out of Africa and is

    vital for phylogeography studies in the Western European

    population [11]. Although the spread of Haplogroup R is

    very wide it is unlikely that any subhaplogroup in Europe of

    Central Asia (migration routes out of India) has descended

    from the most recent common ancestor of the population

    under study.

    IV. FUTURE WOR K

    The present study has been undertaken using the infinite

    alleles model using only a subset of the sequences sequenced

    by Palanichamy et al. from Uttar Pradesh, Andhra Pradesh,

    West Bengal and North Eastern States. We would like to

    undertake a similar study on the whole dataset using the

    stepwise mutation model which tries to better account for

    the actual mutational process that occurs at microsatellite

    markers scoring the marker lengths. The stepwise mutationmodel looks at the frequency spectrum of the mismatches,

    namely how many loci show no mismatches, 1 mismatch, 2

    mismatches and so on. We would focus on the relationships

    between the different authochthonous haplogroups and

    subhaplogroups in different parts of India. We also want to

    study the subcontinent specific phylogeography and migration

    patterns of the autochthonous subhaplogroups.

    During the course of the project we had developed a python

    program for identifying the mutation sites in the dataset.

    It takes as input a file containing sequences in phylip or

    fasta format, loads the data into a matrix and searches for

    dissimilar bases at indexes. The indexes were numbered

    16024-16569 for HVR1 region. It aligns the mutation sites tothe Cambridge Reference Sequence (rCRS) and returns the

    number and indexes of the segregation sites as its output. We

    plan to come up with a GUI based version of this program.

    We plan to call it MutaSeg.

    REFERENCES

    [1] Griffiths, R. C., and Simon Tavare. Ancestral inference in populationgenetics Statistical Science (1994): 307-319.

    [2] Palanichamy, et al. Phylogeny of Mitochondrial DNA Macrohaplogroup Nin India, Based on Complete Sequencing: Implications for the Peoplingof South Asia. The American Journal of Human Genetics, 75 (2004),966-978

    [3] Paradis E., Claude J. and Strimmer K. 2004. APE: analyses of phyloge-netics and evolution in R language. Bioinformatics 20: 289-290.

    [4] Parsons, Thomas J., et al. A high observed substitution rate in the humanmitochondrial DNA control region Nature genetics 15.4 (1997): 363-368.

    [5] Saitou, Naruya, and Masatoshi Nei. The neighbor-joining method: anew method for reconstructing phylogenetic trees. Molecular biology andevolution 4.4 (1987): 406-425.

    [6] Karmin, Monika. Human mitochondrial DNA haplogroup R in India:dissecting the phylogenetic tree of South Asian-specific lineages. Diss.2005.

    [7] Rosenberg, Noah A., and Magnus Nordborg. Genealogical trees, coales-cent theory and the analysis of genetic polymorphisms. Nature ReviewsGenetics 3.5 (2002): 380-390.

    [8] van Oven M, Kayser M. 2009. Updated comprehensive phylogenetic treeof global human mitochondrial DNA variation. Hum Mutat 30(2):E386-E394.

    [9] Metspalu, Mait, et al. Most of the extant mtDNA boundaries in south andsouthwest Asia were likely shaped during the initial settlement of Eurasiaby anatomically modern humans. BMC genetics 5.1 (2004): 26.

    [10] Narushima, H. and Hanazawa, M. A more efficient algorithm forMPR problems in phylogeny. Discrete Applied Mathematics, 80 (1997),231238.

    [11] Richards, Martin B., et al. Phylogeography of mitochondrial DNA in

    western Europe. Annals of human genetics 62.3 (1998): 241-260.[12] Maji, Suvendu, S. Krithika, and T. S. Vasulu. Phylogeographic distri-

    bution of mitochondrial DNA macrohaplogroup M in India. Journal ofgenetics 88.1 (2009): 127-139.

    [13] Ingman, M. and Gyllensten, U. Human Mitochondrial GenomeDatabase, a resource for population genetics and medical sciences.Nucleic Acids Res 34, D749-D751 (2006).

    [14] Walsh, Bruce. Estimating the time to the most recent common ancestorfor the Y chromosome or mitochondrial DNA for a pair of individuals.Genetics 158.2 (2001): 897-912.

    [15] Kimura, Motoo. A simple method for estimating evolutionary rates ofbase substitutions through comparative studies of nucleotide sequences.Journal of molecular evolution 16.2 (1980): 111-120.