Upload
ngothuan
View
222
Download
0
Embed Size (px)
Citation preview
Evolutionary Genetics: Part 1
Polymorphism in DNA
S. peruvianum
S. chilense
Winter Semester 2012-2013
Prof Aurélien TellierFG Populationsgenetik
Color code
Color code:
Red = Important result or definition
Purple: exercise to do
Green: some bits of maths
Population genetics
Why study the genetics of populations?
� the population is the main unit at which selection acts !!!!!!!!!
Evolution = changes between generations of frequency of
characters, traits or alleles
Useful definitions
� DNA has 4 bases Adenine, Guanine, Cytosine, Thymine
� Haploid= organisms with one set of non-paired chromosomes
� Diploids, Polyploids
� Tetraploid for Maize,
� Up to Hexaploid for Wheat
� Decaploid for strawberries
� Pentapoid to Duodecaploid for sugarcane
� Chromosomal location of a gene is a locus
� Several alleles can be observed at a locus (one from mother and one from father)
� The complete set of alleles in a species or population = gene pool
� The occurrence of one allele in proportion to total in gene pool = allele frequency
Population genetics
Populations and species show variability:
� what type and how much genetic variation exist within populations/species?
� what are the forces that influence the amount of variation within
populations?
� First question: what is the variability at the genetic (DNA) level?
Definition: population genetics is the study of the frequencies of alleles in
populations as well as their temporal and spatial changes
Some definitions: point mutations
Point mutation = change in the DNA base (e.g. T becomes G)
Insertions – deletions = removal or insertion of bases
exercise 1.1 and 1.2
use DNASP and the file 055twolines.fas
Some definitions: point mutations
Consequences on the protein:
Synonymous mutations: do not change the codon and the Amino Acid
Non-synonymous mutations: change the Amino Acid
Silent sites = non-coding regions + synonymous sites
- Frameshift mutation: change reading frame (due to indels)
- Nonsense mutation: stop codon is introduced
Positions in the codon:
-A position is fourfold degenerate if any nucleotide change specifies the same AA (only 3rd position of a codon) ex: Glycine codons
-Twofold degenerate if two our of 4 changes specifies the same AA (ex 3rd position of Glutamic Acid)
-Threefold degenerate? 3rd position Isoleucine
-Non-degenerate: any change specifies a different AA
Practical exercise
Using database to find info on sequences
Exercise: Download the file: data-plants.fas
1) Open it with DNASP. What do you see?
2) Go to www.ncbi.nih.gov to the BLAST tools
3) Look at the options, these are plant sequences. Can you retrieve where the
sequences are from?
4) You will be directed to the results of the BLAST: lets look at them by moving the
cursor on the lines. Scores for aligments.
5) What are the best hits? Then you will be directed to the GenBank directory of
sequences. What information are there for these sequences?
Practical exercise
Then we will insert information from the database into the sequences in DNASP.
Place the coding region using the: data -> assign coding regions
Then you see changes in the way DNASP shows the sequences (see example)
Open also in Mesquite
Can you find out how many changes are silent, synonymous or non-synonymous?
How many SNPs are there? (Single Nucleotide Polymorphisms = mutational changes)
Different types of data
� Patterns of diversity can be observed in populations, species or among species.
� Phylogenetic trees fall in the between species comparison class
� Type of data: DNA sequences (SNPs), proteines sequences, microsatellites
� Microsatellites are short stretches of repeated DNA: TATATATATATATA
� What matters is the number of motif repeats
� One can look at their size using electrophoresis gel, but they contain less
information than SNP data.
� Their mutation rate is also higher due to the ripping of the Polymerase on them.
Questions to ask when looking at data
-Are the sequences already aligned?
-Are the data from one population or more than one? Or different species?
-Are the data from sexually or asexually reproducing organisms?
-Are the sequences from coding or non-coding, or both DNA?
-Are the data from one locus or several loci?
-Do we see all sites or only the variable ones (SNP, indels or both)?
-Do we see all sequences or only the different ones?
-Are the data from microsatellites or SNPs?
=> Go to Exercise 1.3 for also the Solanum data
Population genetics: 4 evolutionary forces
random genomic processes
(mutation, duplication, recombination, gene conversion)
natural
selection
random demographic
process (drift)
random spatial
process (migration)
molecular diversity
Population genetics investigates the laws governing the genetic structure of
populations, and changes in allele frequencies over time
Population genetics: 4 evolutionary forces
random genomic processes
(mutation, duplication, recombination, gene conversion)
natural
selection
random demographic
process (drift)
random spatial
process (migration)
molecular diversity
We want to infer the role of the evolutionary forces from sequence data
(very useful tool is the coalescent theory)
phenotypic
variability
Molecular clock
� when DNA is passed from one generation to the next there is a constant
probability called µ that a mutation occurs
� because the polymerase is not error free
� we assume the rate is constant (though over long periods of time this may not be
true)
� Probability of mutation of what?
� at a site (per site mutation rate)
� at an entire locus (locus mutation rate)
� genome wide mutation rate
� µ is the probability per generation of a mutation, and (1- µ) is the probability that
no mutation occurs
Molecular clock
� thus P[no mutation for t generations] = (1- µ)t
� as long as µ << t we can use an approximation:
1 + x ≈ e x and 1 - x ≈ e -x
� result: P[no mutation for t generations] = e –µt graph of the exponential?
Molecular clock
� T is the time to the next mutation
� P[no mutation for t generations] = P[T ≥ t ] = e –µt has
� Separated by at least 2t generations, and split between Drosophila populations
= 100 000 generations (10 000 years)
2
1 1[ ] and [ ]E T Var T
µ µ= =
Molecular clock
� D is the number of polymorphic sites between the sequences, also called the
divergence
� K mutations have appeared along the branches of the two descendants populations,
which have length of 2t
Molecular clock
� Maths 2: estimator
� assuming K = D
� thus E[D] = E[K] = 2µt
� so we obtain a first estimator for µ:
� So we can calculate the mutation rate if we know the time of divergence and
the divergence from molecular data!!!!
� Exercise 1.4: calculate the mutation rate in Drosophila
ˆ2
D
tµ =