75
An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Embed Size (px)

Citation preview

Page 1: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

An Introduction to Sequence Variation

Chris Lee

Dept. of Chemistry & Biochemistry

UCLA

Page 2: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Types of Polymorphism

• Single nucleotide polymorphisms (SNP) constitute about 90% of polymorphisms.

• Insertions, deletions.

• Microsatellite repeats: a locus where different numbers of copies of a short repeat sequence are found in different people.

• Gross genetic losses or rearrangements.

Page 3: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Large-scale Polymorphism

Page 4: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Single Nucleotide Polymorphisms

• Each person different at 1 in 1000 letters.

• SNPs responsible for human individuality!

• Some SNPs cause human diseases (e.g. cancer, cystic fibrosis, Alzheimer’s).

• Enormous efforts have been made to identify specific mutations that cause disease.

Page 5: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Single Nucleotide Polymorphism

Page 6: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Mutation can occur as easily as the loss of a single chemical group from one nucleotide base, e.g. the amino group of cytosine.

Page 7: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA
Page 8: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Creating a Mutation

Page 9: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Genomic Density of SNPs

• Comparing two random chromosome, one SNP per 1000 bp.

• Comparing 40 people (2 chromosomes each), expect 17 million SNPs in the complete human genome (3 billion bp).

• In coding region (5% of genome) expect 500,000 cSNPs, perhaps 6 per gene.

Page 10: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNPs: a detailed record of human genetic history

• Each SNP is typically a single mutation event, that occurred in a context of certain pre-existing SNPs.

• As time passes this context is gradually lost due to recombination.

A B C D E F

B C D

SNP C initially createdlinked to SNPs ABDEF...

“Island” of linkage shrinks...

time

Page 11: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

A record of the origins, migrations, and mixing of the

world’s peoples• The size of the “island” of strong linkage

around a SNP indicates its age (small = old)

• The SNPs it’s linked to give a “genetic fingerprint” of the original person it’s from.

• In principle each SNP can be used to track all his descendants.

• Each person has 300,000 common SNPs-- a very rich record of their genetic history.

Page 12: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNPs in lipoprotein lipase (LPL) gene.

Page 13: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP genotypes in 71 individuals in the LPL gene

heterozygote (X/y)

homozygote (y/y)

Page 14: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Allele Frequency

Page 15: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Haplotypes reconstructed from LPL genotype data

Page 16: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Linkage Disequilibrium

Page 17: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

The Hunt for Disease Genes

• Currently: finding a disease gene can take years, because there are very few markers, forcing researchers to search dozens of genes.

• SNPs are a powerful tool for discovering genes that cause disease: with a SNP in every gene, could directly map a disease to a single gene.

Page 18: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Mapping Disease Genes

chromosome

genes

microsatellite

SNPs

• Look for genetic linkage of disease to marker

• Microsatellite markers are too widely spaced to get to the individual gene level.

• There are common SNPs in every gene.

disease gene

Page 19: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Identification of SNPs

• In 1998, Wang et al. reported ~ 3000 SNPs.

• Currently about 200,000 SNPs have been identified in total by experiment (in public databases).

• A pharmaceutical industry SNP consortium has been formed to fund identification of 300,000 SNPs to be shared publically.

Page 20: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNPs for Pharmacogenomics

• Differences in efficacy and side effects from person to person can be a big problem for drug clinical trials / approval.

• If SNPs that correlate with these differences can be identified, the clinical trial could be limited to patients where efficacy is likely to be best, with least side effects.

• These SNPs would then also have to be tested on prospective patients for the drug.

Page 21: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Single-nucleotide polymorphism in the human mu opioid receptor gene alters -endorphin binding and activity: Possible implications for opiate addictionBond et al. PNAS 95:9608

The mu opioid receptor is the primary site of action for the most commonly used opioids, including morphine, heroin, fentanyl, and methadone.

The A118G variant receptor binds -endorphin, an endogenous opioid that activates the mu opioid receptor, approximately three times more tightly than the most common allelic form of the receptor. Furthermore, -endorphin is approximately three times more potent at the A118G variant receptor than at the most common allelic form in agonist-induced activation of G protein-coupled potassium channels.

Page 22: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Comprehensive EST Analysis of Single Nucleotide Polymorphism

in the Human Genome

Chris Lee

Dept. of Chemistry & Biochemistry

UCLA

Page 23: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Targeting Functional Polymorphism via Expressed Sequences

• Only 5% of the human genome corresponds to coding “genes” coding functional protein.

• Look for functional SNPs by targeting these gene sequence regions.

• Genes are “expressed” by transcription into mRNA, which is spliced, poly-adenylated and transcribed.

• Purify polyA-mRNA, make cDNA, sequence.

Page 24: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Detection from ESTs

• 1.4 million Expressed Sequence Tag (EST) sequences, 300-500 bp, from 950 people.

• How to put together all the ESTs from the same gene, without mixing up related genes?

• How to distinguish sequencing errors (very common) from genuine Single Nucleotide Polymorphisms?

Page 25: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Detection Approaches

• Experimentally: random sampling of DNA. Very expensive, slow.

• Computationally: find SNPs from existing experimental data. Sort out real SNPs from experimental sequencing errors. Difficult statistical and computational problems.

• This experimental data was sitting around for years...

Page 26: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Distinguishing SNPs from Sequencing Errors

The frequency and pattern in which a polymorphism is observed, must rise above the rate of background, random error. Single-pass read sequences contain many errors which complicate the reliable detection of SNPs. There are miscalls (N), and frequent letter duplications / losses in runs (repeats of a single letter). These non-uniform error rates are critical in assessing the statistical significance of candidate SNPs like A (not in a run) vs. T (problematic because it involves a GG run).

A T

Page 27: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

How to address this?

• Adopt rigorous statistical approach based on measured frequencies from very large data.

• Bayesian inference: carefully separate observations from hidden states you want to make inferences about.

• “Integrate out” all assumptions by considering all possible values of the assumptions.

• Explicitly measure degree of uncertainty in the predictions due to poor data, ambiguity.

Page 28: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Odds ratio: SNP model vs. sequencing error model

Consider both models: are the observations more consistent with a SNP or sequencing error?

)|(

)|(

errobsp

SNPobspSNPscore

Page 29: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Error Model: treat True gene sequence as unknown

•Treat all sequences T as equally likely (before youconsider the actual observations (chromatograms).•Sum error model probability over all possible T.

T

TpTobsperrobsp )()|()|(

Page 30: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Model

)(3

1),|()()|(),|()|( ** TpTTobspTpTTpTTobspSNPobsp

T T

•Rather than summing SNP model probability over all possible T, T*, calculate the probability for a specific SNP T* in a specific consensus T.

T

Tobsp

TTobspSNPscore

)|(

),|(

3

1 *

Page 31: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Sequencing error model

i A

ii

i TAobspTobspTobsp )|,()|()|(

T i A

i TAobsperrobsp )|,()|(

Treat individual observed sequences i as independent;treat alignment (what errors occurred) as uncertain.

Treat true gene sequence T as uncertain: sum over all possible T

Page 32: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Hidden Markov Model Discrimination of SNP vs. Error

The match states (M) of a profile is the equivalent of the true population sequence, and deletion (D), insertion (I) and emission probabilities are set to be the observed frequencies of sequencing errors conditioned on local sequence context. The sum probability for the SNP model, vs. the sum probability for the error-only model, yields an odds-ratio for the SNP.

Page 33: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

To assess putative SNP, consider all alternative possibilities

• Sequencing error: calculate odds ratio SNP vs. error. Use PHRED score, local context.

• Orientation errors: ESTs reported backwards?• Chimeras, mixed clusters: ESTs may not be

properly clustered. Some ESTs chimeric?• Alignments: all possible ways EST could have

been emitted from true sequence T.• “true” sequence: all possible T for the gene.

Page 34: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP Model: “Local” allele frequency qz in one person

A

zii qTTAobspzTTobsp ),,|,(),,|( **

z = 0, 1, 2 … qz = z/N, where N = 2 typically

2,1,0

** ),,|()(),|(z Li

iLi zTTobspzpTTobsp

zNz qqz

Nqzp

)1()|( Assuming Hardy-Weinberg

Page 35: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Use Library information: which sequences are from same person!

1

0

** )(),,|(),|(L

Li dqqpqTTobspTTobsp

1

02,1,0

*2 )(),,|,()1(2

dqqpzTTAobspqqzL z Li A

izz

Combine observations from all libraries L, and treat populationallele frequency q as uncertain (so take integral over q= (0,1) ).

Page 36: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Posterior probability for population allele frequency q

1

0

*

*

*

),,|(

),,|(),,|(

LLi

LLi

dqqTTobsp

qTTobspobsTTqp

Gives posterior distribution for q, taking into account all error rates in the observations, amount of sequence and library availability, ambiguities in the sequence, etc.

Page 37: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

6 SNP observations from one library

-0.000002

0

0.000002

0.000004

0.000006

0.000008

0.00001

0.000012

0.000014

0 0.2 0.4 0.6 0.8 1

Series1

Page 38: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

6 SNP observations scattered over all libraries

0

1E-13

2E-13

3E-13

4E-13

5E-13

6E-13

7E-13

8E-13

0 0.2 0.4 0.6 0.8 1

Series1

Page 39: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Alignment Accuracy Challenges

• Automatic Multiple Sequence Alignment of 1000+ sequences is problematic.

• Alignment accuracy is much more of a problem for SNP detection than for simply getting the right consensus. Consensus merely requires that the majority be aligned, whereas even a single alignment error will result in an incorrect SNP prediction.

Page 40: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Sequencing Error Analysis

• We have produced a dataset of 400,000,000 bp where we have reliable consensus, and therefore can identify all the sequencing errors. This could provide “corrected” EST sequences, or alternatively consensus, assembled gene sequences for a large fraction of human genes.

• This also provides detailed statistics on the frequency of different types of sequencing errors, which show a startling variation depending on local sequence context. Background error rates of 0.3% substitution, 0.3% insertion, 0.7% deletion, rise dramatically

Page 41: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Example SNP: GGA C/T CAA

Cluster AA702884

C vs. T polymorphism

Novel SNP, not previously identified.

Page 42: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Automated SNP Detection

Word frequency based overlap & orientation detection

Input Unigene: 1,400,000 Human ESTs, 300-500 bp long

Reorient ESTs: catch reversals,place in 5’ -> 3’ orientation

EST Alignment: accuracypredict gene consensus & SNPs

Statistical Assessmentof candidate SNPs

10-5000 ESTs per gene, 80,000 genes, 500-5000 bp long

>50,000 believable SNPs hidden among >10,000,000 sequencing errors.

Try all possible orientations;

Don’t trust Unigene!Many errors in the reported data e.g. reversals, in majority of clusters!

Page 43: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Sequence Alignment

Page 44: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA
Page 45: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Current Status: Results

• 400,000,000 bp aligned w/ reliable consensus.

• 83,000 consensus gene sequences produced.

• 20,000 show significant homology to known proteins, almost all in expected + orientation.

• 75,000 SNPs above LOD score of 3.

• 30000 SNPs above LOD score of 6.

• current estimate: 60,000 high frequency SNPs.

Page 46: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Hs#S785496

Hs#S1065649

Hs#S706294

Hs#S730843

Hs#S751356

Hs#S786081

Hs#S417458

Hs#S751274

Hs#S483955

Hs#S1434119

Hs#S1065241

CONSENS0 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

1970 1980 1990 2000 2010 2020 2030 2040 2050

gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

gagggccc.actcccttg.ctggccccagccctgctgna.nt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

gagg..cccactcccttg.ctggccccagccctgctgan.at.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

gagggccccactcccttg.ctagtgtcagccctgctggggat.ccccgcctggccaggagcagagcacgggtggtccccattccaccccaagagaact

gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact

gagggccccactcccttg.ctaggac.agcc.tgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

gagggccccactcccttg.ctggccccagccctgctgga.atancccgcctggccaggagcag.gcacgggtnatccccgttccaccccaagagaact

gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact

gagggccccactccctgggcttggcccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

Megakaryocyte Potentiating Factor (Unigene Cluster Hs.155981)

gagggccc.actcccttg.ctggccccagcc.tgctgga.gt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact

aagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaaaagaact

Page 47: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Chromatographic Evidence

Hs#S785496zu42c08.r1

Hs#S1065649oz03ho7.x1*

G

A

G G T G G T C C C

G G T G A T C C C

Page 48: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

RFLP Detection of SNPs 1 2 3 4 5 6 7 8 9 10 11

86 nt

67 nt

35 nt32 nt

genotypeGG GG GG AA GG GA GG GA GG AA GA

67 bases

G / A

MboI [MboI]

86 bases32 bases 35 bases

GATC G TC

Page 49: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Verified 56 of 79 SNPs tested so far

RFLP Verification on 16-24 DNA Samples

0%10%20%30%40%50%60%70%80%90%

1-6 6-20 >20

score

%ve

rifi

ed

Page 50: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Verification Test: Whitehead cSNPs

• Whitehead Institute has systematically searched for SNPs in 106 genes, using 20 Europeans, 10 Africans, 10 Asians.

• On 54 genes, our predicted cSNPs (score>3) are verified by their results at a 70% rate.The Whitehead set may be incomplete.

Page 51: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

cSNPs low frequency high frequencyGene #seq #lib predicted verified predicted verified predicted verifiedAHC 16 6 0APOD 207 59 3 0 3 0AR 18 4 0AT3 55 12 2 2 0 2 2BDNF 18 9 0CETP 14 8 1 1 0 1 1CGA 122 17 0CNTF 4 1 0COMT 183 67 8 2 2 0 6 2CYP11A 64 21 3 0 2 0 1 0CYP11B2 7 4 0DRD1 4 1 0F10 29 15 2 1 1 0 1 1F13A1 141 40 2 2 0 2 2F2 37 14 0F5 49 14 5 5 1 1 4 4F9 7 2 1 1 0 1 1FGB 266 32 3 3 0 3 3

total 30 17 9 1 21 16percentage verified 57% 11% 76%

Page 52: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Validation Test: HLA-A

• HLA polymorphism has been studied very extensively for the general population, providing a “gold standard” for all true positives.

• 140 distinct HLA-A allele sequences available from Anthony Nolan Foundation database.

Are any of our predicted HLA-A SNPs not independently verified by this data?

Page 53: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

T C T Tgatggccgtc atggcgcccc gaaccctcgt cctgctactc tcgggggccc tggccctgac ccagacctgg C T C T T T A

A A T AC C A T Agcgggctccc actccatgag gtatttcttc acatccgtgt cccggcccgg ccgcggggag ccccgcttca A T A AC C A A A AA G A T

A T A Atcgccgtggg ctacgtggac gacacgcagt tcgtgcggtt cgacagcgac gccgcgagcc agaggatgga A A T A T G G A G C AA T T CA A C G AA Agccgcgggcg ccgtggatag agcaggaggg gccggagtat tgggacgggg agacacggaa tgtgaaggcc A AA T T CA C A G A AA T

A A GG T C C A C T C Acactcacaga ctgaccgagt ggacctgggg accctgcgcg gctactacaa ccagagcgag gccggttctc G T C C AG CC T T C A A A

G C G T C T C Gacaccatcca gataatgtat ggctgcgacg tggggtcgga cgggcgcttc ctccgcgggt accacaggac C GG T C C T C G TGGAG T G G C C A

T T A Agcctacgacg gcaaggatta catcgccctg aacgaggacc tgcgctcttg accgcggcgg acatggcggc T T A A C A A A

C G C C A A T G CA CA T Ttcagatcacc aagcgcaagt gggaggcggc ccatgtggcg gagcagttga gagcctacct ggagggcacg T C C A A T G A CA T T T G C G

C Atgcgtggagt ggctccgcag atacctggag aacgggaagg agacgctgca gcgcacggac gcccccaaga C C G C GA

A C C C A A T A Gcgcatatgac tcaccacgct gtctctgacc atgaggccac cctgaggtgc tgggccctga gcttctaccc A C C C A A T A G

Ttgcggagatc acactgacct ggcagcggga tggggaggac cagacccagg acacggagct cgtggagacc T C T

C C A G TAT A Gaggcctgcag gggatggaac cttccagaag tgggcggctg tggtggtgcc ttctggacag gagcagagat G TAT A G A A

T C TAacacctgcca tgtgcagcat gagggtctgc ccaagcccct caccctgaga tgggagccgt cttcccagcc A T C G TA

G A C A T Gcaccatcccc atcgtgggca tcattgctgg cctggttctc tttggagctg tgatcactgg agctgtggtc G A T C A C C A T G

A G C T A T Cgctgctgtga tgtggaggag gaagagctca gatagaaaag gagggagcta ctctcaggct gcaagcagtg C G A C T T A

A C acagtgccca gggctctgat gtgtctctca cagcttgtaa agtgtga A C

Page 54: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

HLA-A: 89% Verification Rate

• Of total 108 SNPs we predicted in the coding region of HLA-A, 96 are independently validated by the known HLA-A allele sequences, and 12 are not.

• By comparison, the NCI CGAP project (based on the same EST data) predicts just 10 SNPs in HLA-A (>90% false negatives!)

Page 55: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Mass Spectrometry Validation

• SNPs change the mass of a DNA fragment.

• Sequenom Inc. has tested more than 1000 of our SNPs using mass spectrometry of pooled DNA samples.

• 80% were detectably polymorphic in samples of 90 people.

Page 56: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Bioinformatics Key to SNPs

Estimated Number of Human SNPs found

05000

1000015000

2000025000

3000035000

4000045000

50000

MIT-AFFY NCI (NIH) Wash. U. UCLA

Page 57: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

EST-based SNP detection similar in reliability with experimental

methods

project # SNPs verification# people methodPicoult Newberg et al 850 63% 18 sequencing & GBABuetow et al. 3000 82% 10 to 90 RFLPUCLA high-LOD 30000 69%, 79% 8 to 24 RFLPUCLA LOD 3 75000 57% 8 to 24 RFLP

Halushka et al. 874 79% resequenced VDACargill et al. 560 55%, 60% resequenced VDA, DHPLC

Page 58: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Application to Disease Gene Mapping

• How do SNPs compare with traditional marker sets used for disease gene mapping projects?

• Density: how dense is the marker set, when mapped onto the human genome?

• Ideal: at least one marker per gene (strong linkage disequilibrium within 3kb)

• Ideal: high heterozygosity for good statistics

Page 59: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Chromosome 22

Contig NT_001454 (14.6 MB)

1

14

13

12

11

10

9

8

6

7

5

4

3

2

1.4 MB from 22q13.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

EIF3S7 (261/106)

Hs.107692 (5/1)

PVALB (66/17)

NCF4 (2/1)

CSF2RB (29/14)

TST(118/56)

Hs.94810 (27/13) Hs.194750 (4/3)

IL2RB (1/1)

RAC2 (67/35)

MFNG (123/37)

MSE55 (26/16)

Hs.25744 (20/13)

Hs.197713 (3/3)

Hs.143856 (3/2)

Hs.146766 (4/2)Hs.212478 (1/1)

Hs.118700 (3/2)

Hs.190885 (9/6)

Hs.97858 (3/2)

Hs.187933 (2/1)

Hs.187027 (2/2)

Hs.119913 (1/1)

Hs.205802 (1/1)

Hs.196536 (3/2)

Hs.174434 (1/1)

Hs.176560 (6/6)

Hs.147244 (1/1)

Hs.178824 (1/1)

Hs.207456 (1/1)

Hs.220558 (3/2)

Hs.139929 (16/12) Hs.193078 (3/2)

Hs.196941 (20/15)

MICROSATELLITE

SNP

Hs.22011 (27/14)

Hs.187981 (8/3)

Hs.57973 (19/12)

Hs.6071 (48/32)

p

22q11.2 - q13.3

p13p12

Hs.211929 (108/51)

Hs.177397 (2/2)

Hs.7189 (19/14)

Hs.5790 (71/31)

p11.1

p11.2

q11.1q11.2

q12.1q12.2q12.3q13.1q13.2

q13.3

AFM273vd9

AFMa046za5

AFM164ze3

AFM261ye5

Page 60: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Mapping Test: positionally cloned genes

• Positionally cloned genes represent a (somewhat) random sampling of genes.

• They are examples of actual disease-gene mapping targets, that typically took years of linkage analysis and chromosome walking to find.

• How good is the coverage and heterozygosity of our SNP marker set for these genes?

Page 61: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Gene Disease SNPS HeterozygosityALD X-Linked Adrenoleukodystrophy 0APC Adenomatous Polyposis Coli 7 0.45, 0.42, 0.40, 0.39, 0.26CFTR Cystic Fibrosis 4 0.48, 0.38, 0.35, 0.35CHM Choroideremia 0CLC1 Thomsen Disease 0DM Myotonic Dystrophy 0DMD Duchenne Muscular Dystrophy 3 0.47, 0.34, 0.24FMR1 Fragile X Syndrome 5 0.21, 0.21, 0.18, 0.18, 0.18GK Glycerol Kinase Deficiency 9 0.50, 0.50, 0.49, 0.49, 0.49GLYRA2 Hyperekplexia 0HD Huntington's Disease 4 0.50, 0.46, 0.35, 0.35KRT9 Epidermolytic Palmoplantar Keratoderma 0MLH1 Hereditary Non-polyposis Colon Cancer 2 0.24, 0.08MNK Menkes Syndrome 0MSH2 Hereditary Non-polyposis Colon Cancer 2 0.31, 0.28NDP Norrie Disease 1 0.49NF1 Neurofibromatosis, Type 1 3 0.50, 0.50, 0.47NF2 Neurofibromatosis, Type 2 3 0.49, 0.48, 0.26OCRL Lowe Syndrome 7 0.50, 0.50, 0.50, 0.46, 0.33PAX3 Waardenburg Syndrome 5 0.38, 0.38, 0.38, 0.38, 0.38PAX6 Aniridia 0PKD1 Polycystic Kidney Disease 0RB1 Retinoblastoma 3 0.36, 0.28, 0.10RET Multiple Endocrine Neoplasia 2A 0SOD1 Amyotrophic Lateral Sclerosis 11 0.48, 0.35, 0.26, 0.21, 0.16SRY Gonadal Dysgenesis 0TSC Tuberous Sclerosis 0VHL Von Hippel-Lindau Disease 13 0.50, 0.50, 0.49, 0.48, 0.48WND Wilson Disease 0WT1 Wilms Tumor 3 0.44, 0.43, 0.41

Page 62: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP validation tests: globin

• globin polymorphism has been studied intensively, identifying 100s of substitutions

• Verify predicted SNPs against known mutations.

• We detect 21 SNPs in globin, 17 within exons.

Page 63: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP codon distribution

FEATURE n_SNPscod_pos_1 2cod_pos_2 4cod_pos_3 11

SNPs highly biased towards third codon position

Page 64: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNP substitution typeFEATURE n_SNPssilent 6conservative 9non-conserved 2

SNPs Biased towards Silent or Conservative Substitutions

Page 65: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

codon polymorphism protein disease

pos. AA AA LOD f (%) pos. location type association

3 CAC HIS CAT HIS 83.1 17 2 surface silent

2 GAG GLU GTG VAL 240.8 5 6 ab interface non-conserved sickle cell

3 GGC GLY GGA GLY 48.9 9 16 surface silent

2 AAG LYS AGG ARG 52.9 11 17 surface conservative

3 AGG ARG AGT SER 13.2 8 30 ab interface non-conserved hemolytic anemia

3 CTG LEU CTA LEU 12.9 7 31 core silent

3 GTG VAL GTC VAL 12.1 8 33 ab interface silent

2 GTC VAL GCC VAL 7.4 8 34 ab interface silent

3 CAC HIS CAA GLN 14.3 2 77 surface conservative

3 GAC ASP GAA GLU 3 2 79 surface conservative

1 AAG LYS GAG GLU 2.3 2 82 surface conservative

2 ACC THR AAC ASN 23.2 3 84 surface conservative

1 CTC LEU TTC PHE 2.1 2 105 ab interface conservative erythrocytosis

3 GTG VAL GTT VAL 4.8 2 113 surface silent

3 CAC HIS CAA GLN 7.6 5 117 surface conservative

3 GAA GLU GAT ASP 4.2 4 121 surface conservative

3 GTG VAL GCC ALA 23.7 2 134 core conservative

Page 66: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

SNPs detect three disease alleles

• Mutations previously identified as causing disease, catalogued by Online Mendelian Inheritance in Man.

• The only two non-conservative amino acid substitutions detected.

• All three at the - chain interface.

Page 67: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Verified SNP: Hb Tacoma disrupts - interface

Page 68: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

His 77 Gln Exposed, unlikely to disrupt stability

Page 69: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA
Page 70: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA
Page 71: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA
Page 72: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA
Page 73: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

What are the most polymorphic genes in the Human Genome?

• Very large differences in polymorphism levels in different genes.

• Maintaining high levels of diversity (large numbers of alleles) may indicate a selective pressure.

• What can we learn from patterns of polymorphism?

• Why are some genes so polymorphic?

Page 74: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

The Most Polymorphic Genes: Five Classes

• Direct interactions with pathogens.

• Very highly expressed genes.

• Genes involved in tumorigenesis and survival/growth of tumors.

• Viral- and transposon-derived sequences.

• Large families of highly similar genes?

Page 75: An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Acknowledgements

• Christopher Lee: K. Irizarry, B. Modrek, C. Grasso

• Wing Wong (Statistics): C. Li

• Stan Nelson (Human Genetics): V. Kustanovich, N. Brown