6
OEB 242: Population Genetics Exam Review, Spring 2015 1 Statistics Review HYPOTHESIS TESTING o Null hypothesis has two parts: Substantive (what are values we expect if nothing interesting is happening?) and formal (how much deviation from expected values do we allow?) Some exemplars: H0: Alleles at locus A and locus B assort independently; thus any deviation from a 1:1:1:1 gametic ratio is no greater than could be explained by chance alone at the α=.05 level. H0: The population is in Hardy-Weinberg equilibrium; thus any deviation from a 1:2:1 genotypic ratio is no greater than could be explained by chance alone at the α=.1 level. o p-value represents P(observed data | H0) “statistics means never having to say you’re certain” -- Must specify significance threshold “fail to reject H0” (why not “accept H0?”) reject H0: what can you therefore conclude (if anything?) o Degrees of freedom are critical for connecting the test statistic to a p-value in e.g. a chi-squared test. Given a contingency table, the d.f. represents the minimum number of entries necessary to repopulate the entire thing, holding constant what is known about the dataset (the total number of datapoints and the proportions of datapoints that fall into one class or another). Find by taking (total number of classes of data) – 1 (for fixing Ntot) – 1 (for every independent parameter estimated when furnishing expected values). RANDOM VARIABLES o An unspecified value, that takes on actual values according to a probability distribution o Mean or expected value is a weighted average of the possible values an r.v. can take o Variance is the expected value of the squared deviations from the mean: Var(X) = E[(X-μ) 2 ] o We have used a few different kinds of random variables in this course: Binomial random variables represent the number of ‘successes’ in n independent trials, each of which has probability of success = p. An example is the Wright-Fisher model of drift, where we imagine reproduction as sampling from an infinite pool of gametes. = ~(, )) = ! ! ! !!! . Mean = np; var = npq Poisson random variables are binomial random variables with large n and small p. They are computationally more tractable and are useful to describe scenarios where you have very many chances to do something rare. Mutations, for example, are modeled as a Poisson process. = ~(λ)) = ! !! ! ! !! Mean = var = λ Geometric random variables represent the number of ‘failures’ before getting one ‘success’ with probability p. The Kingman coalescent, for example, imagines non-coalescence as a ‘failure’ with probability q = 1-p and coalescence as a ‘success’ where p is equal to the frequency of the allele in question. = ~()) = ! Mean = q/p, var = q/p 2 Exponential random variables are the continuous analogues of geometric random variables. The Kingman coalescent often uses this approximation, which holds when the population is large. = ~(λ)) = λ !!! Mean = λ -1 , var = λ -2 o There are a few different ways to talk about the dependencies of random variables: Covariance is an analogue of variance for two random variables. It describes the extent to which two r.v.s track each other: if I change one, how does the other change? Covariance is the expected value of the products of the deviations from the mean: Cov(X,Y) = E[(X-μx) (Y-μy)] Correlation coefficient is a scaled version of covariance that falls between -1 (perfectly anticorrelated) and 1 (perfectly correlated). Divide the covariance by the product of the standard deviations (i.e., square root of the variance) of the two random variables to normalize. The slope of the regression line is slightly different: it measures the directness of the association between two r.v.s, whereas covariance and correlation measure the precision or tightness of that association. Recall, for example, that the slope of the regression line between midparent and offspring gives the narrow sense heritability.

OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

Embed Size (px)

Citation preview

Page 1: OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

OEB 242: Population Genetics Exam Review, Spring 2015  

1

Statistics Review • HYPOTHESIS TESTING

o Null hypothesis has two parts: Substantive (what are values we expect if nothing interesting is happening?) and formal (how much deviation from expected values do we allow?)

§ Some exemplars: • H0: Alleles at locus A and locus B assort independently; thus any deviation from a 1:1:1:1

gametic ratio is no greater than could be explained by chance alone at the α=.05 level. • H0: The population is in Hardy-Weinberg equilibrium; thus any deviation from a 1:2:1

genotypic ratio is no greater than could be explained by chance alone at the α=.1 level. o p-value represents P(observed data | H0)

§ “statistics means never having to say you’re certain” -- Must specify significance threshold § “fail to reject H0” (why not “accept H0?”) § reject H0: what can you therefore conclude (if anything?)

o Degrees of freedom are critical for connecting the test statistic to a p-value in e.g. a chi-squared test. Given a

contingency table, the d.f. represents the minimum number of entries necessary to repopulate the entire thing, holding constant what is known about the dataset (the total number of datapoints and the proportions of datapoints that fall into one class or another).

§ Find by taking (total number of classes of data) – 1 (for fixing Ntot) – 1 (for every independent parameter estimated when furnishing expected values).

• RANDOM VARIABLES

o An unspecified value, that takes on actual values according to a probability distribution o Mean or expected value is a weighted average of the possible values an r.v. can take o Variance is the expected value of the squared deviations from the mean: Var(X) = E[(X-µ)2] o We have used a few different kinds of random variables in this course:

§ Binomial random variables represent the number of ‘successes’ in n independent trials, each of which has probability of success = p. An example is the Wright-Fisher model of drift, where we imagine reproduction as sampling from an infinite pool of gametes.

• 𝑃 𝑋 = 𝑘    𝑋~𝐵𝑖𝑛(𝑛, 𝑘)) =   !! 𝑝

!𝑞!!!. • Mean = np; var = npq

§ Poisson random variables are binomial random variables with large n and small p. They are computationally more tractable and are useful to describe scenarios where you have very many chances to do something rare. Mutations, for example, are modeled as a Poisson process.

• 𝑃 𝑋 = 𝑘    𝑋~𝑃𝑜𝑖𝑠(λ)) =   !!!!!

!!

• Mean = var = λ § Geometric random variables represent the number of ‘failures’ before getting one ‘success’ with

probability p. The Kingman coalescent, for example, imagines non-coalescence as a ‘failure’ with probability q = 1-p and coalescence as a ‘success’ where p is equal to the frequency of the allele in question.

• 𝑃 𝑋 = 𝑘    𝑋~𝐺𝑒𝑜𝑚(𝑝)) =  𝑞!𝑝 • Mean = q/p, var = q/p2

§ Exponential random variables are the continuous analogues of geometric random variables. The Kingman coalescent often uses this approximation, which holds when the population is large.

• 𝑃 𝑋 = 𝑘    𝑋~𝐸𝑥𝑝(λ)) =  λ𝑒!!! • Mean = λ-1, var = λ-2

o There are a few different ways to talk about the dependencies of random variables:

§ Covariance is an analogue of variance for two random variables. It describes the extent to which two r.v.s track each other: if I change one, how does the other change? Covariance is the expected value of the products of the deviations from the mean: Cov(X,Y) = E[(X-µx) (Y-µy)]

§ Correlation coefficient is a scaled version of covariance that falls between -1 (perfectly anticorrelated) and 1 (perfectly correlated). Divide the covariance by the product of the standard deviations (i.e., square root of the variance) of the two random variables to normalize.

§ The slope of the regression line is slightly different: it measures the directness of the association between two r.v.s, whereas covariance and correlation measure the precision or tightness of that association. Recall, for example, that the slope of the regression line between midparent and offspring gives the narrow sense heritability.

Page 2: OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

OEB 242: Population Genetics Exam Review, Spring 2015  

2

Chapter 4: Mutation and Neutral Theory

§ Infinite alleles model o Assume each mutation creates a new allele. Hence, homozygosity implies identity-by-

descent ( ) o F = homozygosity = probability two randomly chosen chromosomes (alleles) are IBD

§ 𝐹! = 1 − 𝜇 ! !!!

+   1 − 𝜇 ! 1 − !!!

𝐹!!! § Equilibrium value of F (mutation-drift balance; Ft = Ft-1): _____

o H = heterozygosity = ______ § Equilibrium value of H (mutation-drift balance; Ht = Ht-1):

!!!!

§ Infinite sites model o Assume each mutation affects one base and that there is no recombination.

o Used to derive Kingman coalescent: 𝑇!~ exp!!

!!;  𝐸 𝑇! =   !!

!(!!!)

§ Use to predict number of mutations separating two sequences (= pairwise diversity or per-site heterozygosity, Π) by multiplying mutation rate times length of two branches. à 𝐸 Π =  

§ Use to predict number of segregating sites in a sample of k alleles by multiplying mutation rate times . à 𝐸 S =  Θ !

!!!!!!!

§ The neutral theory o The infinite sites model lets us estimate Θ in several different ways, which forms the

basis of Tajima’s D and other neutrality tests. § 𝐷 =  !!!!!

!"#(!"#$%&'(%)

§ The denominator is a normalizing factor that is difficult to solve analytically. The numerator tells us whether using pairwise diversity or the total number of segregating sites gives a greater estimate of theta.

§ If D is positive, this suggests a surplus of alleles (which inflate pairwise diversity disproportionately). This is consistent with coalescent times, and suggests balancing selection or admixture.

§ If D is negative, this suggests a surplus of alleles (which inflate the number of segregating sites disproportionately). This is consistent with coalescent times, and suggests directional selection or population growth.

o In the neutral theory, the probability of fixation of an allele is its frequency. § As a corollary, the fixation rate is therefore the neutral mutation rate, independent

of population size. The probability that any one new allele fixes is (____), and the population-wide rate of new mutations is (____). The product of these two values is simply µ.

§ The average time between fixation events is therefore (1/µ). o The expected time to fixation of an allele, given that it will eventually fix, is 4N

generations. The expected time to loss of a new neutral allele, conditional on its eventual loss, is 2ln(2N).

Page 3: OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

OEB 242: Population Genetics Exam Review, Spring 2015  

3

Chapter 7: Molecular Population Genetics § Here, we are looking at timescales that don’t allow us to invoke the infinite sites/alleles models.

Generally, we need to account for the possibility of multiple mutations at the same site. Distinguish (the number of differences observed) from (the number of substitutions inferred).

o The procedure for inferring substitutions from differences depends on whether we are talking about amino acids or about nucleotides, and depending on what assumptions we make about the probabilities of various mutations.

o The Jukes-Cantor model, for example, assumes ___________________________. From this assumption, one can establish a recurrence equation for the probability of a site taking on a given identity, which then can be translated into a partial differential equation and solved to find an estimator for k based on d.

o We can test for selection in protein-coding regions by comparing either the rate of differences (dN/dS) or the rate of substitutions (Ka/Ks) for non-synonymous/amino-acid-changing mutations versus synonymous mutations, per site. Mutations that change protein structure are presumably ‘more visible’ to selection than those that do not.

§ In calculating these statistics, one must account for the number of sites that are potentially synonymous or non-synonymous. A twofold degenerate site is often counted as 2/3 non, 1/3 syn.

§ Under neutrality, the ratio ~=1. § Under purifying (negative) selection, the ratio may be . (Changes to the

protein are not tolerated by selection and are removed) § Under positive selection, the ratio may be _________. (Changes to the protein

are favored and accumulate at an accelerated rate) § The molecular clock assumes that the number of substitutions is directly proportional to the

amount of evolutionary time that separates two sequences. It is critical to realize, however, that TMRCA is half of this value (because the two lineages together sum to the total amount of evolutionary time)

o In general, substitution rates do vary across organisms, across genomic regions, etc., and so the ‘clock’ is not always constant. But the point remains that we can quantify these rates and then make assumptions about them in order to interpret their significance.

o The McDonald-Kreitman test assumes that, under neutrality, a molecular clock type assumption will preserve the dN/dS ratio across large evolutionary timescales. The test quantifies this by comparing the dN/dS for recent, microevolutionary events (which give rise to polymorphism within a population or species) with the dN/dS for ancient, macroevolutionary events (which give rise to differences between species).

§ The test is most straightforwardly implemented as a chi-squared test using a contingency table of non-synonymous/synonymous and polymorphic/divergent mutations. (In this case, there are ____ data classes, and ____ fixed parameters, namely __________________), meaning that df = ____.

§ If polymorphism is not proportional to divergence, we must interpret: • If div > poly: suggests (eg) ___________________ • If div < poly: suggests (eg) ___________________

Page 4: OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

OEB 242: Population Genetics Exam Review, Spring 2015  

4

Chapter 8: Evolutionary Quantitative Genetics o The Mendelian paradigm is monogenic, but we want to be able to talk about polygenic (complex,

quantitative) traits. We want to be able to ask questions like, how much do genetics influence the phenotype? Unfortunately, this is an ill-formed question, and there’s no way to talk about “how genetic a trait is” in the abstract. We have to ground our discussion in populations.

o This opens the door for the concept of heritability o Technical definition: _______________________________________ o Interpretation: extent to which genetic differences among individuals explain phenotypic

differences among individuals o Doesn’t tell us how many genes are involved in a trait, e.g., but does help us understand

relative contribution of genetics and environment for a given population o There are two ways to get a hold on it: by measuring variance or by calculating

dominance coefficients. o Variance

• One approach involves quantifying heritability by looking at the relationships among the variances of various quantities (e.g. phenotype) and how they are related from one individual to its family member

• Variance decomposition: VP = VG + VE + VGE (Variance due to _______, _________, and _________________ together explain the total phenotypic variance)

o VG = VA + VD + VI (Variance due to genetics, in turn, is explained by the variance due to additive allele effects, dominance effects, and epistatic interactions)

• Broad sense: H2 = VG / VP • Narrow sense: h2 = VA / VP

o h2 = slope for regression of mean offspring vs mean parents

Visscher, Hill & Wray, 2008

• We can use the narrow-sense heritability for predicting response to selection o “Breeder’s equation”: R=h2S, where

§ S = ____________ = (mean phenotypic value for breeding population) – (mean phenotypic value for entire population) and

§ R = response to selection = the change in the mean phenotypic value of the population after ____________

• Dominance • Another approach to quantitative genetics posits two values, a and d, which can be

used to represent the strength of dominance and the relationship among the phenotypic values associated with each genotypic class.

o We assume the mean phenotypes (also called ‘genotypic value’) of AA, AA’ and A’A’ are __, __, and ___ respectively.

o Then, using HWE proportions, we calculate the population mean (which depends critically on genotype frequencies). The mean is ___ + ___ + ___ (which can be simplified).

Page 5: OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

OEB 242: Population Genetics Exam Review, Spring 2015  

5

o We can then describe our genotypic values as deviations from the population mean. We say: (genotypic value) – (something) = (pop mean); therefore (something) = (pop mean) – (genotypic value). Now ‘something’ is our genotypic value expressed as a deviation from the population mean (see left side of below)

o We can then calculate a new statistic, the breeding values for each genotype. We can make sense of this by thinking of the breeding value of an allele as representing the phenotypic contribution it would make if it were strictly additive, and hence the breeding value of a genotype equals the sum of the breeding values of the constituent alleles.

§ Thus, if there is no dominance and all effects are purely additive, genotypic values _________ breeding values.

§ We could calculate the per-allele effect on phenotype to get breeding values, or we could project our genotypic values onto its least-squares fit regression line as in the diagram.

§ On the right side of the below diagram, we express breeding values as deviations from the population mean. As with the genotypic values, this convention makes analysis easier.

§ Because of the definition of breeding values given above, we can get VA by looking at the variance of the breeding values.

o We can compare the breeding values to the genotypic values to get the dominance deviations. (As suggested above, when there is no dominance, breeding values = genotypic values and hence dominance deviation = ___).

§ In the below diagram, these values appear in blue, and represent the distance between the genotypic values (black circles) and the breeding values (white circles).

§ We can look at the variance of the dominance deviations to calculate VD.

       Good luck on the exam, try the review problems on the website, and don’t forget to send in your final papers via email by 6PM on Weds 5/6 (end of reading period). JV will hold extra OH on Saturday, 4/25 from 1pm – 3pm in Northwest Labs 471.

Page 6: OEB 242: Population Genetics Exam Review, Spring … · OEB 242: Population Genetics Exam Review, Spring 2015 ! 1 Statistics Review • HYPOTHESIS TESTING ... o We can then calculate

OEB 242: Population Genetics Exam Review, Spring 2015  

6