49
Introduction to Haplotype Estimation Stat/Biostat 550

Introduction to Haplotype Estimation Stat/Biostat 550

Embed Size (px)

Citation preview

Page 1: Introduction to Haplotype Estimation Stat/Biostat 550

Introduction to Haplotype Estimation

Stat/Biostat 550

Page 2: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• Suppose we genotype individuals at a number of tightly linked SNPs.

A C G C C T T T G C G C

G A A C C C C C A G G C

Page 3: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• Suppose we genotype individuals at a number of tightly linked SNPs.

A C G C C T T T G C G C

G A A C C C C C A G G C

Page 4: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• Suppose we genotype individuals at a number of tightly linked SNPs.

Page 5: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• What do the types on the two chromosomes look like?

Page 6: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• What do the types on the two chromosomes look like?

Page 7: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• What do the types on the two chromosomes look like?

Page 8: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• What do the types on the two chromosomes look like?

Page 9: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem

• What do the types on the two chromosomes look like?

Page 10: Introduction to Haplotype Estimation Stat/Biostat 550

Haplotypes: who cares?

• LD mapping: increase power?

• LD mapping: decrease genotyping?

• Evolutionary studies: selection, recombination, gene conversion, population structure,…

Many people, for many different reasons…

Page 11: Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem – potential solutions

• Molecular methods

• Collect family data

• Statistical methods for population data

Page 12: Introduction to Haplotype Estimation Stat/Biostat 550

The Simplest Case

• What do the types on the two chromosomes look like?

Page 13: Introduction to Haplotype Estimation Stat/Biostat 550

The Next Simplest Case

• What do the types on the two chromosomes look like?

Page 14: Introduction to Haplotype Estimation Stat/Biostat 550

The Next Simplest Case

• What do the types on the two chromosomes look like?

Page 15: Introduction to Haplotype Estimation Stat/Biostat 550

The first difficult case…

• What do the types on the two chromosomes look like?

Page 16: Introduction to Haplotype Estimation Stat/Biostat 550

The first difficult case…

• What do the types on the two chromosomes look like?

Page 17: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method (1990)

• Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.

Page 18: Introduction to Haplotype Estimation Stat/Biostat 550

Is it this configuration?

1

2

3

Page 19: Introduction to Haplotype Estimation Stat/Biostat 550

…or this one?

1

2

3

Page 20: Introduction to Haplotype Estimation Stat/Biostat 550

This one is more probable.

1

2

3

Page 21: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method (Clark, 1990)

• Identify the unambiguous individuals.

• Make a list of “known” haplotypes.

• Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.

Page 22: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method

List of known haps.1

2

3

Page 23: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method

List of known haps.1

2

3

Page 24: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 1

3

1

2

Page 25: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 1

List of known haps.1

2

3

Page 26: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 1

List of known haps.1

2

3

Page 27: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 1

List of known haps.1

2

3

Page 28: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 1

List of known haps.1

2

3

Page 29: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 1

List of known haps.1

2

3

Answer depends on order list is considered….

… and frequency information is ignored

Page 30: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 2

3

1

2

Page 31: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Method: Problem 2

3

1

2

List of known haps.

Algorithm can fail to resolve all haplotypes…

… because looks only for exact matches

Page 32: Introduction to Haplotype Estimation Stat/Biostat 550

Clark’s Algorithm: Summary

• Results may depend on order individuals are considered.

• Frequency information is ignored.

• May fail to resolve all haplotypes.

• Fails to assess uncertainty.

• Looks only for exact matches.

• Fast and intuitive(?).

Page 33: Introduction to Haplotype Estimation Stat/Biostat 550

Maximum Likelihood (EM Algorithm)

• Idea: find haplotype frequencies (f1,…fN) to maximise probability of observed genotype data (g1,…,gn).

}21:2,1{ 211 ),...|Pr(ighhhh hhNi ffffg

),...|Pr(),...|,...,Pr( 111 Ni

iNn ffgffgg

Page 34: Introduction to Haplotype Estimation Stat/Biostat 550

Bayesian version

• Replace single pass through data, with iterative scheme.

• Allow for uncertainty in resolution.

• Use frequency information.

Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).

Modify Clark’s algorithm:

Page 35: Introduction to Haplotype Estimation Stat/Biostat 550

Example

List of known haps.1

2

3Matches 1 known

Does not match any

31

Assigned moderate probability

Page 36: Introduction to Haplotype Estimation Stat/Biostat 550

Example

List of known haps.1

2

3Matches 3 known

Does not match any

31

Assigned higher probability

Page 37: Introduction to Haplotype Estimation Stat/Biostat 550

Example

List of known haps.1

2

3Does not match any

Does not match any

31

Assigned low probability

Page 38: Introduction to Haplotype Estimation Stat/Biostat 550

Problems with EM/naïve Gibbs

• Potentially (very) large number of parameters to estimate, leading to inaccurate estimates.

• Can be time-consuming for large problems.

• Can “converge” to poor local optima (alleviated by multiple runs).

Page 39: Introduction to Haplotype Estimation Stat/Biostat 550

Further modification

• Take into account “near misses”, as well as exact matches.

(PHASE v1.0: Stephens, Smith and Donnelly 2001)

Page 40: Introduction to Haplotype Estimation Stat/Biostat 550

Example

List of known haps.1

2

3Matches 1 known

Differs by 2 from 3 known

31

Page 41: Introduction to Haplotype Estimation Stat/Biostat 550

Example

List of known haps.1

2

3Matches 3 known

Differs by 2 from 1 known

31

Page 42: Introduction to Haplotype Estimation Stat/Biostat 550

Example

List of known haps.1

2

3Differs by 1 from 3 known

Differs by 1 from 1 known

31

How to balance these possibilities?

Page 43: Introduction to Haplotype Estimation Stat/Biostat 550

The key question

• What is the conditional distribution of the next haplotype, given a set of known haplotypes?

Page 44: Introduction to Haplotype Estimation Stat/Biostat 550

Example

1

2

Given the above haplotypes, what would you expect the next haplotype to look like?

Page 45: Introduction to Haplotype Estimation Stat/Biostat 550

Qualitative answer

• The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype.

• Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.

Page 46: Introduction to Haplotype Estimation Stat/Biostat 550

Comparisons on simulated data

Page 47: Introduction to Haplotype Estimation Stat/Biostat 550
Page 48: Introduction to Haplotype Estimation Stat/Biostat 550

Problems

• Time-consuming for large problems.

• Can “converge” to poor local optima.

• Ignores recombination (decay of LD with distance).

• How should uncertainty in haplotype estimates be treated?

Page 49: Introduction to Haplotype Estimation Stat/Biostat 550

… to be continued.