Chap. 6. Molecular Phylogeny. Charles Darwin, 1859 Natural selection Evolution Change in frequency...

Preview:

Citation preview

Chap. 6. Molecular Phylogeny

Charles Darwin, 1859 Natural selection

Evolution Change in frequency of genes in a population

Heritable changes in a population over many generations

Process of mutation with selectionTwo essential factors that define evolution

Error-prone self-replication Variation in success at self-replication

Evolution

Self-replication Whatever is evolving must have the ability to make copies of itself

Typical developments, aging etc., are not evolution

Genes can self-replicate in the context of cells that they reside in

“replicator” can self-replicateAsexual organisms like bacteria can self-replicate

Sexual organisms can replicate, but inheriting from parents

Darwin focused on genes rather than organisms as the fundamental replicators

Error-prone Self-replication

Error-prone Copies are not always identical to the originals

Perfect copies will not foster evolution

In fact, current genes are from gradual changes from previous versions with slight errors

Errors are essential for evolution, provided they occur not too frequently

Error-prone Self-replication

Cell Replication Replication

One double-strand DNA to two identical double-strand DNA’s

One mother strand is in each of two daughter DNA’s (semi-conservative replication)

Replication step 1 Separate the two DNA strands

At origin of replication

Replication step 2 Synthesize DNA from 5’ to 3’

end and at the same time 3’ to 5’ end DNA polymerase catalyzes

only in 5’ to 3’ direction in new chains

Original 3’-5’ (leading) strand continues replicating

Original 5’-3’ (lagging) strand replicate semi-discontiously at every 1000-2000 bp (Ozaki fragment)

Replication step 3 Proofread and repair

detect mutation, once in 104 to 105 bases

Mismatch repair in E.Coli(a)Newly synthesized DNA (red) has a mismatch (G-T).(b) MutH, MutS, and MutL link the mismatch with the nearest methylation site (blue)(c) An exonuclease removes from red strand(d) DNA polymerases replace it

How to find the origination/termination site ? Chargaff parity rules (CPR) -1951

# of A = # of T; # of C = # of G CPR I – double strands of DNAs

Obvious from complementary relationship

CPR II – single strand of DNA Cause is not known yet Violation is called ‘skew’ GC skew: (G-C)/(G+C)

GC skew

Max or min of GC skew appears at ori or ter sites

Oligomer skew fi : # of oligomer i in a segment

OAi = ln(fi/fi’)

Most organisms can increase exponentially If all organisms survived and multiplied at the same rate, there will be no change in frequency of the variants, and thus no evolution

Limited by food, space, predators, etc. When population size is limited, not all variants survive

A possibility of natural selectionAlso, chance effects exist

Equal-sized populations with two variants will not stay the same even with the same degree of fitness

Called random drift, the chance effect will take over the whole population

This implies that evolution can occur even without natural selection, referred to as neutral evolution

Variation

Any change in a gene sequence that is passed on to offspringCaused by

A damage to DNA moledule (from radiation, etc.) Errors in replication

Point mutation – simplest form of mutation and occurs all over DNA sequences

Transition – mutation within purine (A,G) or pyrimidine (C,T/U)

Transversion – mutation between nt groups Effects depend on where mutations occur

Non-coding region – no effect on proteins, and neutral

But may have significant effects if occurring in control region

Coding region Synonymous substitution when a mutation does not

change AA Non-synonymous

AA is replaced by another stop codon is introduced

Mutation

Models of nucleotide substitution

A G

T C

transition

transition

transversiontransversion

A

Jukes and Cantor one-parameter model of nucleotide substitution (=)

G

T C

A

Kimura model of nucleotide substitution (assumes ≠ )

G

T C

Jukes-Cantor (JC) Kimura 2P Tamura

Indel mutation Small indels of a single base of a few bases are frequent

Caused by slippage during DNA replication Particularly frequent with repeated sequences

GCGC…: insertion of extra GC or deletion cause slight slippage

CAG repeated region in huntingtin protein can expand, causing Huntington’s disease

Indels can cause frame shift, if indels are not multiples of three

Gene inversion Whole genes are copied to offspring in reverse direction

Translocation Whole genes can be deleted from one genome and inserted into another

Mutation

Orthologs:members of a gene (protein)family in variousorganisms.This tree showsglobin orthologs.

Mutation Example

Paralogs: members of a gene (protein) family within aspecies. This tree shows human globin paralogs.

Globin phylogeny by Dayhoff (1972)

Globin phylogeny by Dayhoff in evolutionary time (1972)

Mature insulin consists of an A chain and B chainheterodimer connected by disulphide bridges

The signal peptide and C peptide are cleaved,and their sequences display fewerfunctional constraints.

Note the sequence divergence in the disulfide loop region of the A chain

Historical background: insulinBy the 1950s, it became clear that amino acid substitutions occur nonrandomly.

For example, Sanger and colleagues noted that most amino acid changes in the insulin A chain are restricted to a disulfide loop region.

Such differences are called “neutral” changes (Kimura, 1968; Jukes and Cantor, 1969)

Subsequent studies at the DNA level showed that rate of nucleotide (and of amino acid) substitution is about six-to ten-fold higher in the C peptide, relative to the A and B chains.

Number of nucleotide substitutions/site/year

0.1 x 10-9

0.1 x 10-91 x 10-9

Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulinfrom other species. Why?

The answer is that guinea pig and coypu insulindo not bind two zinc ions, while insulin molecules frommost other species do. There was a relaxation on thestructural constraints of these molecules, and so the genes diverged rapidly.

Historical background: insulin

Guinea pig and coypu insulin have undergone anextremely rapid rate of evolutionary change

Arrows indicate positions at which guinea pig insulin (A chain and B chain) differs from both human and mouse

In the 1960s, sequence data were accumulated forsmall, abundant proteins such as globins,cytochromes c, and fibrinopeptides. Some proteinsappeared to evolve slowly, while others evolved rapidly.

Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock:

Molecular clock hypothesis

For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

Millions of years since divergence

corr

ecte

d a

min

o a

cid

ch

ang

es

per

100

res

idu

es (m

)

Dickerson (1971)

If protein sequences evolve at constant rates,they can be used to estimate the times that sequences diverged. This is analogous to datinggeological specimens by radioactive decay.

Molecular clock hypothesis: implications

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are basedupon DNA and protein sequence data.

Population GeneticsGenealogical Tree

Evolution tree of a gene without recombination (mtDNA, chromosome)

Given the current generation, can trace back to a single copy of the gene – coalescence process

Example Human mtDNA is traced back to African woman 200,000 years ago (1996)

Coalescence ModelAssumptions

Constant population of N throughout time Each individual is equally fit (same expected number of offspring) – equally likely to have any of the individuals in the previous generation as mother

Pick two individuals in the present generation Prob. of having the same mother = 1/N

Prob. that their most recent common ancestor lived T generations ago

P(T) = (1 - 1/N)T-1 (1/N) ≈ e-T/N / N Coalescence of the lines of descent of any two individuals is exponentially distributed with the mean time until coalescence of N generations

CoalescenceMitochondrial Eve

Used highly variable non-coding part, called D-loop

The average # of site with difference: 61.1 out of 16,553 bases

The average pairwise difference is 76.7 between Africans, and 38.5 between non-Africans

There have been different divergent population in Africa for much longer

Relatively small population left African and spread through the rest of the world

The earliest branch point – 170,000 ± 50,000

Non-African migration – 52,000 ± 27,000

Purple/Green – all Africans

Yellow/blue – non-Africans

Fixation in Neutral ModelMutation 1 does not survive to the present generationMutation 2 has a chance to spread to the entire population (fixed)Most mutation die outIf a mutation is neutral, the prob. of becoming fixed, Pfix ?

Assume N copies of a gene and that each one is equally likely to mutate

Prob. that mutation occurred in the gene copy of an ancestor of the present generation is 1/N = pfix

New mutation takes place with the prob. of u Rate of new fixation of new mutations is the rate at which mutations occur, multiplied by the prob. that each mutation is fixed:

ufix = (Nu)*pfix = u Shows that the rate of fixation of neutral mutations is equal to the underlying mutation rate and is independent of the population size

Fixation in Neutral ModelNumber of mutation in the population changes on a random basis

If m copies of a neutral mutant sequence at one generation,

The number of copies at the next generation, n ≈ m

Wright-Fisher model Each copy of the gene in

the next generation is randomly selected from genes in the previous generation

Mutation prob. a = m/N, prob. of no mutation = 1-a

Prob. of n mutations in the next generation, p(n) = CN

nan(1-a)N-n

The mean value: Na = m Simulation with N=200 with

2,000 generations

Recommended