Evolutionary Genetics: Part 1 Polymorphism in DNA · Evolutionary Genetics: Part 1 Polymorphism in DNA S. peruvianum S. chilense ... Separated by at least 2t generations, and split

Evolutionary Genetics: Part 1

Polymorphism in DNA

S. peruvianum

S. chilense

Winter Semester 2012-2013

Prof Aurélien TellierFG Populationsgenetik

Color code

Color code:

Red = Important result or definition

Purple: exercise to do

Green: some bits of maths

Population genetics

Why study the genetics of populations?

� the population is the main unit at which selection acts !!!!!!!!!

Evolution = changes between generations of frequency of

characters, traits or alleles

Useful definitions

� DNA has 4 bases Adenine, Guanine, Cytosine, Thymine

� Haploid= organisms with one set of non-paired chromosomes

� Diploids, Polyploids

� Tetraploid for Maize,

� Up to Hexaploid for Wheat

� Decaploid for strawberries

� Pentapoid to Duodecaploid for sugarcane

� Chromosomal location of a gene is a locus

� Several alleles can be observed at a locus (one from mother and one from father)

� The complete set of alleles in a species or population = gene pool

� The occurrence of one allele in proportion to total in gene pool = allele frequency

Population genetics

Populations and species show variability:

� what type and how much genetic variation exist within populations/species?

� what are the forces that influence the amount of variation within

populations?

� First question: what is the variability at the genetic (DNA) level?

Definition: population genetics is the study of the frequencies of alleles in

populations as well as their temporal and spatial changes

Polymorphism in DNA: Sanger sequencing

Polymorphism in DNA: Illumina NGS sequencing

www.seqanswers.com





www.seqanswers.com


Polymorphism in DNA: new new generation…

Polymorphism in DNA: how it looks like

Some definitions

Coding / non-coding DNA

Some definitions

For the coding DNA: start codon, stop codon

Some definitions: point mutations

Point mutation = change in the DNA base (e.g. T becomes G)

Insertions – deletions = removal or insertion of bases

exercise 1.1 and 1.2

use DNASP and the file 055twolines.fas

Some definitions: point mutations

Consequences on the protein:

Synonymous mutations: do not change the codon and the Amino Acid

Non-synonymous mutations: change the Amino Acid

Silent sites = non-coding regions + synonymous sites

- Frameshift mutation: change reading frame (due to indels)

- Nonsense mutation: stop codon is introduced

Positions in the codon:

-A position is fourfold degenerate if any nucleotide change specifies the same AA (only 3rd position of a codon) ex: Glycine codons

-Twofold degenerate if two our of 4 changes specifies the same AA (ex 3rd position of Glutamic Acid)

-Threefold degenerate? 3rd position Isoleucine

-Non-degenerate: any change specifies a different AA

Practical exercise

Using database to find info on sequences

Exercise: Download the file: data-plants.fas

1) Open it with DNASP. What do you see?

2) Go to www.ncbi.nih.gov to the BLAST tools

3) Look at the options, these are plant sequences. Can you retrieve where the

sequences are from?

4) You will be directed to the results of the BLAST: lets look at them by moving the

cursor on the lines. Scores for aligments.

5) What are the best hits? Then you will be directed to the GenBank directory of

sequences. What information are there for these sequences?

Practical exercise

Practical exercise

Practical exercise

Then we will insert information from the database into the sequences in DNASP.

Place the coding region using the: data -> assign coding regions

Then you see changes in the way DNASP shows the sequences (see example)

Open also in Mesquite

Can you find out how many changes are silent, synonymous or non-synonymous?

How many SNPs are there? (Single Nucleotide Polymorphisms = mutational changes)

Different types of data

� Patterns of diversity can be observed in populations, species or among species.

� Phylogenetic trees fall in the between species comparison class

� Type of data: DNA sequences (SNPs), proteines sequences, microsatellites

� Microsatellites are short stretches of repeated DNA: TATATATATATATA

� What matters is the number of motif repeats

� One can look at their size using electrophoresis gel, but they contain less

information than SNP data.

� Their mutation rate is also higher due to the ripping of the Polymerase on them.

Questions to ask when looking at data

-Are the sequences already aligned?

-Are the data from one population or more than one? Or different species?

-Are the data from sexually or asexually reproducing organisms?

-Are the sequences from coding or non-coding, or both DNA?

-Are the data from one locus or several loci?

-Do we see all sites or only the variable ones (SNP, indels or both)?

-Do we see all sequences or only the different ones?

-Are the data from microsatellites or SNPs?

=> Go to Exercise 1.3 for also the Solanum data

Population genetics: 4 evolutionary forces

random genomic processes

(mutation, duplication, recombination, gene conversion)

natural

selection

random demographic

process (drift)

random spatial

process (migration)

molecular diversity

Population genetics investigates the laws governing the genetic structure of

populations, and changes in allele frequencies over time

Population genetics: 4 evolutionary forces

random genomic processes

(mutation, duplication, recombination, gene conversion)

natural

selection

random demographic

process (drift)

random spatial

process (migration)

molecular diversity

We want to infer the role of the evolutionary forces from sequence data

(very useful tool is the coalescent theory)

phenotypic

variability

Divergence and mutation rate

Molecular clock

� when DNA is passed from one generation to the next there is a constant

probability called µ that a mutation occurs

� because the polymerase is not error free

� we assume the rate is constant (though over long periods of time this may not be

true)

� Probability of mutation of what?

� at a site (per site mutation rate)

� at an entire locus (locus mutation rate)

� genome wide mutation rate

� µ is the probability per generation of a mutation, and (1- µ) is the probability that

no mutation occurs

Molecular clock

� thus P[no mutation for t generations] = (1- µ)t

� as long as µ << t we can use an approximation:

1 + x ≈ e x and 1 - x ≈ e -x

� result: P[no mutation for t generations] = e –µt graph of the exponential?

Molecular clock

� Maths 1 : Probability, Expectation and Variance,

exponential distribution

Molecular clock

� T is the time to the next mutation

� P[no mutation for t generations] = P[T ≥ t ] = e –µt has

� Separated by at least 2t generations, and split between Drosophila populations

= 100 000 generations (10 000 years)

2

1 1[ ] and [ ]E T Var T

µ µ= =

Molecular clock

� D is the number of polymorphic sites between the sequences, also called the

divergence

� K mutations have appeared along the branches of the two descendants populations,

which have length of 2t

Molecular clock

� Maths 2: estimator

� assuming K = D

� thus E[D] = E[K] = 2µt

� so we obtain a first estimator for µ:

� So we can calculate the mutation rate if we know the time of divergence and

the divergence from molecular data!!!!

� Exercise 1.4: calculate the mutation rate in Drosophila

ˆ2

D

tµ =

Documents

Evolutionary Genetics: Part 1 Polymorphism in DNA · Evolutionary Genetics: Part 1 Polymorphism in DNA S. peruvianum S. chilense ... Separated by at least 2t generations, and split