52
Software for Merging Software for Merging Microsatellite Datasets Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department of Statistics 3. Department of Biomathematics

Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Embed Size (px)

Citation preview

Page 1: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Software for Merging Microsatellite Software for Merging Microsatellite DatasetsDatasets

Angela Presson2

Jeanette Papp1

Eric Sobel1

Ken Lange1,2,3

1. Department of Human Genetics2. Department of Statistics3. Department of Biomathematics

Page 2: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

2

OutlineOutline

I. How microsatellite data is generated.

II. Alignment concept and Bayesian model.

III. Algorithm and software for aligning microsatellite datasets.

IV. Alignment results

Page 3: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Microsatellite MarkersMicrosatellite Markers Amplify microsatellite DNA: 1-6 nucleotide repeats within or near genes. For

example, a dinucleotide repeat region of DNA: GTGTGT…

A genotype consists of 2 alleles. For example, a SNP (single nucleotide polymorphism) allele consists of only 1 base (A, C, T, G), and possible SNP genotypes include: A-A, A-C, G-C…

Unlike SNP markers, microsatellite markers have many alleles (typically 5-15).

For example, genotypes might be: 200-202, 200-206, 206-204…etc . Alleles are differentiated by base pair length, example 200 = 200 bases in length.

Frequency histogram for a microsatellite marker with 7 alleles:

Marker X

0

0.05

0.1

0.15

0.2

0.25

116 117 118 119 120 121 122 123 124 125 126 127 128 129 130

bp

Fre

quen

cy

cy

Causal roles in diseases especially cancer and neurological diseases (ex. fragile X syndrome)

Page 4: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

4

Why Merge Microsatellite Data?Why Merge Microsatellite Data?• Facilitates genotyping collaboration• Allows you to combine current data with legacy datasets

• Re-genotyping is not always an option (insufficient DNA, subjects may be dead or unavailable, IRB may not approve)

• Increase power with larger sample sizes: crucial for association studies.

• Major genetics study types include linkage (for studying traits in families) and association (for studying traits across unrelated individuals).

• In linkage studies you can pool results from each family fairly easily. In association studies, you gain power from analyzing a combined data set.

Page 5: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

5

Genotyping PlatformsGenotyping Platforms

Old SchoolSlab gel

New SchoolCapillary gel

Page 6: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department
Page 7: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Allele length estimates are binnedAllele length estimates are binnedD4S123

0

0.05

0.1

0.15

0.2

0.25

0.3

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130

bp

Fre

qu

en

cy

11A

108

812L

130

711K

1286

8.5H+1123

58H

122

47G

12036F

118

25E

116

Consecutive IntegerInteger CodeLetter CodeBase Pair Value Bin

bp value: 120.0 121.9 123.8...182.9

Lab 1: 120 122 124... 186Lab 2: 120 121 123... 182

Marker X: base pair value bins

Page 8: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

8

Obstacles to merging by allele lengthObstacles to merging by allele length

1. Different genotyping platforms

2. Different molecular weight standards

3. Different curve fitting algorithms

4. Different primer design

5. Different binning methods

Estimates of allele sizes from different sources do not always differ by the same amount, both between and within markers.

Page 9: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Merge based on allele length (Base-Pair Merge) Merge based on allele length (Base-Pair Merge) vs.vs.

Merge based on allele frequenciesMerge based on allele frequenciesBase Pair Merge

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

23

1

23

3

23

5

23

7

23

9

24

1

24

3

24

5

24

7

24

9

25

1

25

3

25

5

25

7

25

9

26

1

26

3

26

5

26

7

Base Pair Value

Fre

quen

cy

dataset 1

dataset 2

Frequency Merge

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Bins

Fre

quen

cy

dataset 1

dataset 2

Page 10: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

10

Experimental evidenceExperimental evidence

• Weeks et al. genotyped 30 samples at two different genotyping centers (CIDR & MGS) and were unable to accurately align them by using allele length. The resulting error rate was 16.8%.

• Adding a constant number of base pairs to the allele sizes of all markers in 1 data set improved the alignment, but the merged data set still had an error rate of 14%.

• A 1.5% error rate can give false linkage conclusions… (Buetow, 1991).

Frequency information can be more useful than size information for merging genotype data.

Page 11: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

11

Why Is an Automated Method Why Is an Automated Method Necessary?Necessary?

• Manual merging can be extremely time-consuming

• Software would save time and improve accuracy

• “Expert System” gives a Quality Score and recommendations on each merge

Page 12: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

12

Automated MethodAutomated Method

• MicroMerge programmed in Fortran 90

• Aligns pairs of datasets

• Returns files for analyzing merged data

• Quality score indicates whether the results should be analyzed

Page 13: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

13

Information for MicroMergeInformation for MicroMerge

Genotype Information Used

Bin Size Order Yes

Bin Frequencies* Yes

Bin Spacing No

Bin Base Pair Size No

Samples in common Yes

*For Bin frequencies to be useful for merging:

1. The sample size must be reasonably large 2. The samples must have the same ethnicity

Page 14: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

14

Background: Deriving Bayes’ TheoremBackground: Deriving Bayes’ Theorem

Events: A = Pr(Go to party this week), and

B = Pr(Sick this week)

Conditional probability rule:

Pr(A | B) = Pr(A, B) (1) Pr(B | A) = Pr(A, B) (2)

Pr(B) Pr(A)

Solving for Pr(A,B) in (2) and substituting in (1) get simplest form of Bayes’ theorem:

Pr(A | B) = Pr(B | A) Pr(A)

Pr(B)http://en.wikipedia.org/wiki/Bayes%27_theorem

Sign at offices of autonomy in Cambridge.

Page 15: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

15

Deriving Bayes’ Theorem (cont.)Deriving Bayes’ Theorem (cont.)

From bottom of previous slide we have:

Pr(A | B) = Pr(B | A) Pr(A) (3)

Pr(B)

Now let Ai = Pr(Go to party on day i), then by law of total probability: , using (3)

we have:

Pr(Ai | B) = Pr(B | Ai) Pr(Ai) (4)

∑ Pr(B | Ai) Pr(Ai)

Page 16: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Bayesian InferenceBayesian Inference• Goal: Use Bayes’ theorem to infer model parameters θ from data D and prior

notion of θ; ie from the posterior distribution, Pr(θ | D) Pr(θ | D) = Pr(D | θ) Pr(θ) .

∫ Pr(D | θ) Pr(θ)dθ

Where Pr(D | θ) = likelihood of the data given parameters, Pr(θ) = prior distribution of parameters.

• Allows for informative or uninformative priors.• In practice most people use uninformative priors (this is what we do in

MicroMerge). • Informative priors can improve efficiency and limit bias in estimating a parameter

of interest in presence of other variables with strong but uninteresting (ie already known) effects.

• Informative prior example: smoking on infant mortality -- See Dunson 2001 Commentary: Practical Advantages of Bayesian Analysis of Epidemiologic Data for a good discussion of the pros/cons of informative priors.

• Main questions for a Bayesian analysis:• How is θ distributed? (ie what’s the prior?)• What is the likelihood equation for D?

Page 17: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

17

Bayesian Inference – Likelihood Example Bayesian Inference – Likelihood Example for genotype datafor genotype data

• Genetics example: estimate of population allele

frequencies (p) and genotypes (X).

• Hardy-Weinberg (HW) principle: allele frequencies in a population remain constant at the equilbrium distribution (p2, 2pq, q2) unless external forces (non-random mating, selection, etc) are present.

• Examples of HW probabilities for a 4 genotype data set:

1 A-A (pA)2

2 A-G 2pApG

3 C-C (pC)2

4 A-C 2pApC

• Dataset Likelihood = Pr(X | p ) = Pr(A-A)Pr(A-G)Pr(C-C)Pr(A-C)

= (pA)2 2pApG (pC)2 2pApC

Page 18: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

18

MicroMerge Alignment ConceptMicroMerge Alignment Concept• Datasets merged one marker at a time.

• Consider theoretical allele set containing all unique observed alleles.

• Minimum = number of unique alleles in larger dataset • Maximum = sum of unique alleles in each dataset

• Align dataset alleles to theoretical alleles, then to each other.

• A Bayesian MCMC process (using Metrolis-Hastings acceptance/rejection step) finds optimal dataset-theoretical alignment.

• Number of theoretical alleles• Frequencies of theoretical alleles• Dataset-theoretical allele alignments

Page 19: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

19

Bayesian Model in MicroMergeBayesian Model in MicroMerge

• Posterior distribution for the alignments z given the gentoype data X (for a particular marker and ethnic group):

P(z | X) = P(X | z)P(z)

∫ P(X | z)P(z)d(z)

Each alignment z is characterized by three variables:

n = number of theoretical alleles

p = theoretical allele frequencies

B = dataset partition

• The posterior distribution is sampled using MCMC

• The alignment with greatest posterior probability is used to align the datasets

Page 20: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Alignment ExampleAlignment ExampleDataset 1 alleles (DS1): A,B,C,D

Dataset 2 alleles (DS2): A′,B′,C′

Theoretical alleles (TA): 1,2,3,4,5,6,7 => n = 7

TA: 1 2 3 4 5 6 7DS1: A | B | C | D DS2: A′ | B′ | C′

Dataset 1 partition: B1: A = {1,2}, B = {3}, C = {4}, D = {5,6,7}

Dataset 2 partition: B2: A = {1,2}, B = {3}, C = {4,5,6,7}

• Data set alleles can align with more than one theoretical allele.

• In remaining slides, letters indicate dataset alleles or bins, and integers designate theoretical alleles.

Page 21: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

21

MicroMerge Algorithm MicroMerge Algorithm I. Initialize Variables: p ~ Dirichlet() where 1 = 2 = … = n =

1, B is initialized by choosing a random alignment for each dataset, n = nmin.

II. Propose state change: (a) re-sample allele frequencies, (b) change a bin boundary, or (c) change the number of theoretical alleles.

III. Compute likelihood of proposed state

IV. Accept/Reject proposed state based on the likelihood ratio

V. Repeat II-IV: #iterations = (#observed dataset alleles)*(nmax

2)*1.5

VI. The amount of time spent in each configuration is proportional to the relative likelihood of that configuration

Page 22: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

II.II.a.a. Change the Frequencies, Change the Frequencies, ppGoal: re-sample theoretical allele frequencies.

1.Update theoretical allele counts given B, p, n and data

• For every dataset allele compute fractional theoretical allele counts. • Ex. Given n = 3, partition: A = {1}, B = {2,3}

• Every occurrence of data set allele A increments the count of theoretical allele 1 by 1

• Every occurrence of data set allele B increments the count of theoretical allele 2 by p2/(p2+p3) and the count of theoretical allele 3 by p3/(p2+p3)

• => each data set allele occurrence is counted once.

2.Re-Sample Allele Frequencies, p′

Re-sample theoretical allele frequencies (p′ ) from a posterior Dirichlet distribution derived from the prior (θ) and fractional counts (c).

Page 23: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

23

II.II.b.b. Change a Partition, Change a Partition, BB1 1 or or BB22

1. Uniformly choose dataset i.

2. Choose from available boundary positions (arrows) according to weights: 1 = external position (red arrow), 2 = internal (blue arrow).

3. If chosen position is internal, randomly choose left or right boundary.

TA: 1 2 3 4 5 6 7 8DS1: A | B | C | D

Page 24: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

24

II.II.c.c. Change the Allele NumberChange the Allele Number, , nn Uniformly choose a theoretical allele m and either split it

into m and m+1 or amalgamate it with m+1.

n′ = n + 1 or

n′ = n – 1

• Restrictions:• if n = nmax amalgamate only, if n = nmin split only• n can only be decreased if chosen theoretical alleles m and m

+ 1 correspond to identical dataset alleles within each dataset.

• Ex. if dataset alleles {A,B,C,C} and {A′,B′,C′,C′} were aligned to theoretical alleles {1,2,3,4}, then theoretical alleles 3 and 4 could be amalgamated.

Page 25: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

25

III. Data LikelihoodIII. Data Likelihood• Genotype probabilities are computed for each

dataset separately according to Hardy-Weinberg equilibrium.

• Genotype probabilities depend on the partition and allele frequencies.

• Ex. if A = {1,2}, B = {3}, genotype A-B would have probability 2(p1+p2)p3, and A-A would have probability (p1 + p2)2.

• The likelihood is computed by multiplying the probabilities of all dataset genotypes.

Page 26: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

IV. Transition ProbabilityIV. Transition Probability• The transition probability for moving from current state y to proposed state z

for changes to B, p:

• Where L(·) is the likelihood, v(·) is the prior and f(·) is the proposal probability. Prior for p cancels, and v(·) * f(·) cancels for B (and n).

• When n (the number of theoretical alleles) is changed, there is a change in the dimensionality of the parameter space. Thus we implement a reversible jump step:

• Where j(·) denotes jump probabilities and |dz/dy| is a dimension matching Jacobian term. |dz/dy| = pm (frequency of the randomly chosen allele) when an allele is split and 1/pm when an allele is lumped.

Page 27: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Compute Priors when Compute Priors when nn Is Changed Is Changed• When p or B changes, all priors cancel in transition ratio.

• When n changes, compute all priors.

• p ~ Dirichlet() where : 1 = 2 = … = n = 1 p′ ~ Dirichlet( ′) where ′ : 1 = 2 = … = n′ = 1 • Partition prior comes from graph theory: Pr(Bi) = d(Bi)/(2e) Pr(Bi′) = d(Bi′ )/(2e′)

where d(Bi)= number of possible new partitions following a single move from the current state and e = number of edges for current n.

• The prior probability for n is calculated according to the following formula where t = 0.5. The idea is to encourage smaller values of n.

Page 28: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

28

Incorporating Common SamplesIncorporating Common Samples• If B is changed, check consistency of dataset partitions with

common samples information.

• Likelihood adjusted according to extent of consistency with common samples information.

j/k = theoretical genotype

l/m = current sample genotype dataset 1

r/s = current sample genotype dataset 2

Where Pr(l/m | j/k) = 1 – ε if the current genotype agrees with the theoretical genotype and ε/(ki – 1) if the genotype disagrees, where ki is the total number of possible genotypes in dataset i.

Page 29: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

29

Alignment Quality ScoresAlignment Quality Scores

MicroMerge uses 4 quality scores to indicate alignment confidence. All 4 criteria should be met for a marker to analyze its merged data set:

1.Posterior probability: best score, returned from the model. Keep merged data for a marker if its alignment posterior probability > 0.3.

2.Differences in top 2 alignment posterior probabilities: Top alignment posterior probability should be ≥ 0.1 more than alignment with rank 2.

3.Average overlap probability: quantify how often each pair of data set alleles overlap in the Markov Chain. Average overlap probability > 0.9.

4.Likelihood ratio test: comparing the likelihood of the merged data to the likelihood of the unmerged data. Keep merged data for marker if p < 0.01.

Page 30: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Average overlap probability scoreAverage overlap probability score• If the posterior probability of the best alignment is low, we quantify where the

alignment is uncertain by computing a matrix of posterior overlap probabilities.

• Consider a data set allele i defined by lab 1 and a data set allele j defined by lab 2. These data set alleles are said to overlap if they share one or more theoretical alleles.

• The posterior probability that they overlap oij is approximated by the fraction of the time they overlap in the execution of the Markov chain.

• The overlap matrix O = (oij) provides pair-wise probabilities between aligned alleles. For a high quality alignment these probabilities should be close to 1, and their average should be > 0.9.

Alignment Example:Average overlap =

0.935, and we can see problem lies with allele

15 in DS 1-B

Page 31: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

31

H0: merged model with n ≈ nmin is most likelyH1: un-merged data with n = nmax is most likely

Useful for assessing the quality of the alignment results (about 80% or more markers should fail to reject the null at the 0.01 level).

31

Likelihood ratio quality scoreLikelihood ratio quality score

-2ln(Q0/Q1) ~ Xv

Q0 = merged data likelihood

Q1= original data likelihood

v = df(H1)-df(H0) = , where d is the number of data sets, and |Bj| = (the number of unique bins in data set j) – 1.

Page 32: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

32

MicroMerge OutputMicroMerge Output• Merged pedigree and locus files reflecting top alignment for each

marker (common analysis files for linkage/association studies)

• Alignment file, which contains the top 10 alignments for each marker and quality scores to indicate how much confidence one should have in the alignment.

Page 33: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

33

Output: Alignment fileOutput: Alignment file• Example of successful alignment:

• Example of failed alignment:

Page 34: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Testing MicroMergeTesting MicroMerge• Simulated Datasets (known alignments):

99-251 samples each 7 fictitious markers 3-4 alleles/dataset

• Common Samples Datasets (same samples = known alignments):

30 samples each 31 markers (2 chromosomes: 8 and 17) Same samples typed at 2 centers

• Typical Genotype Datasets (compared with manual alignments):

Dataset1 = 56-92 samples, Dataset2 = 238-850 samples 50 markers 16 common samples

• Typical data from Familial Dyslipidemia study (unknown alignment)

3 Data sets total: Dutch (275 samples), Finnish1 (228 samples), Finnish2 (248 samples)

10 markers No common samples

Page 35: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

35

ResultsResultsData set # Markers # DS1 # DS2 # DS3 Common Samples

Weeks et al. 31 30 30 --- (All 30 typed in common -- but don’t use them all to test MM)

Typical Data 50 56-92 238-850 --- 16

Data set # Correct Markers / Accepted

# Correct Alleles / Accepted

% Markers Accepted

Weeks et al. 28/28 (100%) 100% 28/31 (90.3%)

Typical Data 41/42 (97.6%) 631/633 (99.6%) 41/50 (82%)

• 3 Weeks et al. data markers and 7 typical data markers that could not be aligned by MicroMerge had unclear alignments from manual merging.

Page 36: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

36

Results Familial Dyslipidemia (FD)Results Familial Dyslipidemia (FD)

• When markers were analyzed independently there was one significant result (at 0.05 level) for marker D11S1998.

• Combining data sets found a p-value = 0.011 for marker D11S2002.

• The D11S2002 was discovered to have the strongest linkage among these markers to FD in a larger fine-mapping analysis.

Page 37: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

37

Iterations & Run-times (on 3GHz machine) Iterations & Run-times (on 3GHz machine) #iterations = (#observed dataset alleles)*(nmax

2)*1.5

• Simulated Datasets:

21,384 - 96,384 iterations 7.05 - 11.86 min., 8.20 min. avg.

• Common Samples Datasets:

6,048 - 88,872 iterations .82 - 23.75 min., 12.57 min. avg.

• Typical Genotype Datasets:

56,301 - 1,970,487 iterations 9.67 min. - 10.65 hrs., 1.39 hrs. avg. all but 4 runs took < 2 hrs.

Page 38: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

38

MicroMerge project websiteMicroMerge project websitehttp://www.genetics.ucla.edu/software/micromerge

Page 39: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

39

Recent press for MicroMerge!Recent press for MicroMerge!

J Hered. 2011 Nov-Dec;102(6):697-704.Combining US and Brazilian microsatellite data for a meta-analysis of sheep (Ovis aries) breed diversity: facilitating the FAO Global Plan of Action for Conserving Animal Genetic Resources.Paiva SR, Mariante Ada S, Blackburn HD.

Page 40: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

40

AppendixAppendix

Page 41: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

Partition Prior Partition Prior • The prior on the bin boundaries, b = |Bi|-1, comes from graph theory: d(v)/(2e), d(v) is

the number of edges connected to node v, and e is the total number of edges for all possible nodes:

• Where o is the total number of external theoretical alleles (to the left of the left-most boundary and right of right-most boundary)

1. The two external boundaries can be positioned in o-1 ways

2. Positions for the b-2 internal boundaries can be chosen in ways

3. Each of the o-2 open external spaces can accept one move

4. Each of the n – o – b + 1 internal spaces can accept two moves

• Summing edges over nodes double counts edges (2e)

TA: 1 2 3 4 5 6 7DS1: A | B | C | D

In this example, n = 7, b = 3, o = 3; 2e = 90.

Note: the formula calculates overall

number of edges and does not depend on a given partition! (I just show this partition to

help explain terminology).

# of moves to # possible open positions# of ways boundaries can be positioned

Page 42: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

1. Examples of external boundary positions for n = 7 and b = 2 (there are o-1 positions)

• If o = 2: 1 | 2 3 4 5 6 | 7 → 1 position

• If o = 3: 1| 2 3 4 5 | 6 7 or 1 2|3 4 5 6 | 7 → 2 positions

• If o = 3: the 1 internal boundary can be positioned in 3 choose 1 = 3 ways: 1 | 2 |3 4 5 |6 7 or 1 | 2 3 | 4 5 |6 7 or 1 | 2 3 4 | 5 |6 7

• If o = 4: the 1 internal boundary can be positioned in 2 choose 1 = 2 ways: 1 | 2 |3 4| 5 6 7 or 1 | 2 3 | 4 | 5 6 7

2. Examples of internal boundary positions for n = 7 and b = 3 (the b-2 internal boundaries can be chosen in ways)

Page 43: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

43

Deriving the Jacobian term for: n → n + 1Deriving the Jacobian term for: n → n + 1

• In the case where n increases, an allele with probability pm is split into (un+1)pm= qm and (1-un+1)pm = qm+1 where un+1 ~ Uniform(0,1).

• For qm and qm+1 we have the map:

• The Jacobian of this map is the absolute value of the determinant of the matrix of partial derivatives

• The Jacobian term pm is a volume magnification factor that insures the volume stays consistent in spite of the change in the parameter space.

Where x = un+1pm

y = (1-un+1)pm

u = pm

v = un+1partial derivatives matrix

Page 44: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

44

Deriving the Jacobian term for: n → n - 1Deriving the Jacobian term for: n → n - 1

• Here we compute the inverse map M[-1](qm, qm+1).

• The easiest way to do this is to just compute the partial derivatives of the M[-1](qm, qm+1) map, and take the absolute value of the determinant.

• Take partial derivatives by computing the following 2 terms for the top row: d(qm + qm+1)/d(qm) = 1and d(qm + qm+1)/d(qm+1) = 1 and the following 2 terms for the bottom row: d(qm/(qm+qm+1))/d(qm) and d(qm/(qm+qm+1))/d(qm+1)

Page 45: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

45

Results: Simulated & Weeks et al. DatasetsResults: Simulated & Weeks et al. Datasets

• Simulated Datasets

• All simulated dataset pairs aligned correctly with high posterior probability

• Weeks et al. Datasets

• Having only 30 samples in each dataset hinders merging since the model is based on allele frequencies

• 29/31 markers were aligned correctly• The 2 incorrect markers had genotyping

inconsistencies that were difficult to resolve via manual merging (8/31 had genotyping inconsistencies)

Page 46: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

46

Allele Differences Between Allele Differences Between Genotyping CentersGenotyping Centers

Difference in number of unique observed alleles between centers results from:

(1)Sampling differences: alleles actually missing

(2)Typing differences: all alleles present in both datasets, but binning differences or genotype error causes discrepancies between centers

Page 47: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

47

Model for Model for Sampling DifferencesSampling DifferencesConsider the following alignment between Datasets 1 & 2, where the missing allele is represented as a 0 or “Null allele”.

ex. Ds1 Ds2

A 0

B B′

Dataset 1 observes allele A which is not observed in Dataset 2. In the merged dataset, all A alleles will be coded as 1, and all B and B′ alleles coded as 2.

=> Always finds a 1-to-1 relationship between dataset alleles

Page 48: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

48

Model forModel for Typing Differences Typing DifferencesAlleles in Dataset 1 can align with multiple alleles in Dataset 2 (and vice versa), indicating lumping/splitting typing differences.

ex. Ds1 Ds2

A B′

B B′

Dataset 1 splits allele B′ into alleles A and B or Dataset 2 lumps alleles A and B into B′

=> Often does not result in 1-to-1 relationship between dataset alleles

Page 49: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

49

Results Analyzed by MendelResults Analyzed by Mendel• MicroMerge assumes typing differences commonly

cause differences in the number of observed unique alleles between centers

• Weeks et al. test datasets had genotype inconsistencies in 8/31 markers

• Explains variations in observed number of unique alleles when datasets are large

• A 1-to-1 alignment would be simpler

• It could produce a merged dataset with alleles recoded under a global binning system

• Easier to consolidate data in databases• Merged data can be analyzed with any genetics analysis

software

Page 50: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

50

MicroMerge v2MicroMerge v2• Aligns multiple data sets simultaneously

• Recodes lumped alleles (typing differences model) to a 1-1 alignment (sampling differences model) so that files can be analyzed by genetic analysis software other than Mendel.

• Enables user-supplied population allele frequency vectors to combine small datasets.

Page 51: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

51 51

A Bayesian modelA Bayesian model• The following model and methods consider a particular marker and

ethnic group.

• P(B,p,n | x) = P(x|B,p,n) P(B)P(p)P(n)

∫ P(x|B,p,n) P(B)P(p)P(n)d(B,p,n)

B = dataset partition

p = theoretical allele frequencies

n = number of theoretical alleles

x = all genotypes for all datasets

• The posterior distribution is sampled using MCMC

• The alignment with greatest posterior probability is used to combine the datasets

Page 52: Software for Merging Microsatellite Datasets Angela Presson 2 Jeanette Papp 1 Eric Sobel 1 Ken Lange 1,2,3 1. Department of Human Genetics 2. Department

52

Future WorkFuture Work• Reject poorly aligned allele pairs instead of entire

markers.

• Align datasets containing samples from more than one ethnic group.

• SNP data merging?