9
rSNP's: Evolution Caught in the Act Michael J.T. O'Kelly 6.085 Final Project Presentation In this talk: First, some background science: yeast ribosomal DNA and repeat Single Nucleotide Polymorphism model How to get rSNP data from a shotgun DNA database Improving the data with quality scores Inferring recombination dynamics from rSNP statistics

Required for now!!

Embed Size (px)

Citation preview

rSNP's:Evolution Caught in the Act

Michael J.T. O'Kelly6.085 Final Project Presentation

In this talk:●First, some background science: yeast ribosomal DNA andrepeat Single Nucleotide Polymorphism model

●How to get rSNP data from a shotgun DNA database●Improving the data with quality scores●Inferring recombination dynamics from rSNP statistics

Background: Ribosomal DNA●Yeast rDNA consists of ~150 identical* copies of a 9.1 kbp sequence encoding several ribosomal RNA's. ●Mutation strikes only one repeat at a time. Recombination either duplicates or eliminates neutral mutations, homogenizing the rDNA*as far as anyone knows or cares, so far

●Repeats are gained or lost about every 30 generations, through several recombinatory mechanisms (illustrated at left). ●Mutation in the rDNA array occurs about every 1,000 generations●A repeat Single Nucleotide Polymorphismis a mutation shared by only a fraction of the rDNA repeats in a particular yeast strain

Finding rSNP's:Example: GATACATGTCTTGATAATGT

Let's use BLAST to align shotgun fragments, with a sliding window along the entire consensus rDNA sequence.

●Shotgun DNA library's coverage is ~170x for rDNA.●Align all shotgun sequences that agree (mostly) with the target. ●Basepairs that deviate entirely are conventional Single Nucleotide Polymorphisms●Basepairs that deviate sometimes are probably repeat Single Nucleotide Polymorphisms

ttttctggctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagtaacg ttgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaacc tGATACATTTCTTGATAATGTtgcatatcagtaa tttctggctcattgatagattgttGATACATTTCTTGATCATGT ttGATACATTTCTTGATAATGTtgcatatcagtaac agattgttGATACATTTCTTGATAATGTtgcatatcagt ctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagtaac atagattgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaaccctt ttctggctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagtaac ctcattgatagattgttGATACATTTCTTGATAATGTtgcata tttctggctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaac attgatagattgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaaccc gattgttGATACATTTCTTGATCATGTtgcatatcagtaacgtaaccc ttGATACATTTCTTGATAATGTtgcatatcagtaacgt attgatagattgttGATACATTTCTTGATAATGTt gctcattgatagattgttGATACATTTCTTGATAATGTtgcatat ttGATACATTTCTTGATAATGTtgcatatcagtaacgtaaccctt tcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaaccctt tctggctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcag gattgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaacccttg ttGATACATTTCTTGATAATGTtgcatatcag gatagattgttGATACATTTCTTGATCATGTtgcat tggctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagt ctcattgatagattgttGATACATTTCTTGATCATGTtgcatatcagtaa ttgttGATACATTTCTTGATAATGTtgcatatcagt tggctcattgatagattgttGATACATTTCTTGATAATGTtgcatatcagtttttctggctcattgatagattgttGATACATTTCTTGATAATGTtgcat gatagattgttGATACATTTCTTGATAATGTtgcatatcagtaacgtaaccctt

SNP & rSNP map for one yeast strain

Disagreement ratio shows some bp with 100% disagreement, some with moderate disagreement, and many probably spurious points of low disagreement.

Total coverage varies from 25x to 150x.

Using Quality Scores to evaluate correctness of disagreement

Quality score: n=0-60 represents reliability of nucleotide determination.

Let's reject all scores worse than 30.

Then C is accepted as a probable rSNP, but G is rejected.

G A T A C A T T T C T T G A T A G T G T5 5 5 5 3 6 5 6 3 3 4 5 5 3 3 5 2 3 6 66 8 2 6 1 0 6 0 3 1 2 8 5 5 2 7 6 8 0 0

G A T A C A T T T C T T G A T A A T G T4 5 5 5 2 6 4 5 4 4 5 6 5 4 5 3 5 3 6 64 1 0 8 8 0 0 0 6 2 0 0 3 1 5 6 2 0 0 0

G A T A C A T T T C T T G A T A A T G T4 3 5 4 5 3 6 6 5 3 5 5 4 5 4 3 5 5 5 57 3 9 7 8 9 0 0 8 8 1 8 3 1 2 3 3 8 1 9

G A T A C A T T T C T T G A T C A T G T5 4 4 2 3 3 5 5 5 6 5 5 5 3 4 6 3 3 6 57 8 8 7 6 0 9 2 8 0 6 8 9 4 0 0 7 6 0 9

G A T A C A T T T C T T G A T A A T G T5 5 5 6 4 4 5 4 3 5 5 6 4 6 6 4 5 5 5 69 5 6 0 5 3 5 5 1 9 9 0 4 0 0 4 9 9 2 0

P error =10− n

10

SNP & rSNP map after Quality Score filter

Quality coverage is nearly as frequent as total coverage. Most basepairs that disagreed in only one alignment had low Quality.

rSNP fingerprints of all yeast strains

(Peak heights exaggerated for visibility.) Partial and full peaks line up for many strains.

Aggregate rSNP distribution observed in shotgun database

●For all rSNPs we estimate the fraction of repeats containing each variant letter

●RSNPs tend to be observed in a small minority of repeats, rather than in 50/50 ratios.

●The number of base-pairs having rSNPs, and the fraction of repeats containing each rSNP, are influenced by the underlying dynamics of recombination.

What models best predict the observed distribution?

Uniform vs. Non-Uniform Recombination Models

●Non-uniform recombination results in more peaked rSNP distribution than uniform model. ●Posterior probability of non-uniform model is higher by far. ●Shotgun analysis and labwork agree: recombination in rDNA is non-uniform!

Observed: Expected under models: