53
Statistics How to get data and model to fit together?

Statistics

  • Upload
    rhoda

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistics. How to get data and model to fit together?. Experimental planning. Statistical methods can be used in order to better plan ones experiments: - PowerPoint PPT Presentation

Citation preview

Statistikk

StatisticsHow to get data and model to fit together?Experimental planningStatistical methods can be used in order to better plan ones experiments:You might want to be able to detect (with 95% confidence) a difference larger than a given value between two sets of mean precipitations with a given probability (power). If so, plot the power as a function of the data size and see when the rejection probability (power) goes above your wanted level.If you want to do enough stage-discharge measurements that the uncertainty of the rating curve exponent (b) is with 90% probability less than 0.1, you either need an analytical expression for the uncertainty as a function of the data size and the parameter values, or you need to do simulation.When model + methodology clashes with realityWish to have the relationship between stage (h) and discharge (Q) on the following form:

Q=C(h-h0)b

where h0 is the zero plane, b gives the shape of the river profile and C has to do with the width of the river.

This is adapted using a set of stage-discharge measurements.

With max. likelihoods estimation you get infinite estimates for some datasets! The If you stop the ML-optimization at any given time, the fit is good but the parameter values are unreasonable!

hDatum, h=0Qh0CbFrequentist estimation does not have a way to code for what constitutes reasonable and unreasonable parameter values. Bayesian statistics, on the other handSchools of statistics Bayesian statisticsEverything about our knowledge concerning unknown quantities (parameters and models) is handled using probability theory.

Central in this is Bayes formula.

Prior distributionPosterior distributionData likelihoodWhen using it for parametric inference, Bayes formula allows you to switch between the distribution of data given the parameters (the likelihood) and the distribution of the parameters given the data (the posterior distribution).

Estimates (single values) are no longer the focus, the posterior distribution is! If you want them, you can take the mean, median or mode (peak) of the posterior distribution to use as estimates.

Bayesian statistics a medical warm-upImagine a sickness with a medical tests that always gives positive indication if you have that sickness. Its is quite accurate, only giving false positives in 1% of the cases where the patient doesnt have the sickness. The sickness is however rare, only one in a thousand has it. If you test positive, what is the probability that you have the sickness?

There is thus only a 9% chance that you have the sickness, given that you test positive! What is happening?

Bayesian statistics a graphic medical warm-upOne thousand people before the test, represented by small circles. = Sick = HealthyBayesian statistics a medical warm-up (3)After the test, among the ones testing positive there remains one sick and about 10 healthy persons: = Sick = Healthy

The probability that you have the sickness has increased dramatically, but still ten out of eleven will be healthy even though they have tested positive. Only about 9% tested positive because they actually have the sickness.

A positive test is thus evidence (info increasing the probability) for the sickness, but not so strong evidence that we believe its more likely than not that you have the sickness.

A naive frequentist doing model testing, would say that the probability of testing positive when healthy (1%) is less than the usual significance level (5%), And that all who tested positive thus have the sickness with 95% confidence. An experienced frequentist will call the state of the patient a hidden variable rather than a parameter or model, and then proceed using Bayes formula.

Prior knowledge prior distributionA prior distribution should summarize the knowledge we have concerning the model before the data arrives.

Typically, one chooses a parametric distribution family first, from convenience and from having the right characteristics with respect to the nature of the parameter. Since these distributions again have parameters, these are called hyperparameters. If one suspects this choice can influence the results, one tried several candidate distributions (robustness analysis).

One then adapts the hyperparameters to whatever specific information one has. For instance one can form a 95% credibility interval (an interval encompassing 95% of the probability), by adjusting the parametric distribution coding for the prior distribution.

Common mistake: Looking at the data to determine what a reasonable prior would be. This is prior-data feedback, and example of circular reasoning. I can easily give unreasonable indications of uncertainty and unreasonable model choices.Prior knowledge prior distribution (2)Prior distributions are at first glance purely subjective, but can be made acceptable to others by:Incorporating common knowledge (including previous data) concerning the field of interest (intersubjectivity).

Look at the variations that are in nature itself. For instance, for hydrological stations, what is the typical range of stage-discharge rating curve parameters? Perhaps one can find natures own prior distribution.

Use so-called non-informative prior distributions. PS: Should distributions are often not proper distributions. For instance, there does not exist a probability distribution that gives equal probability to all numbers on the real line. Still, improper prior distributions can give proper posterior distributions. PSS: Do not use this trick when doing model comparison!Bayesian statistics distributionsOne starts off the analysis with two things: A model that says how the data was produced and which parameters that characterizes this distribution. This is the likelihood: f(D|).A prior distribution, f(). Summarizes out pre-knowledge concerning the parameters.

From this, one can calculate the following:The posterior distribution: f(|D). This summarizes our state of knowledge after the data has been handled. If you want estimates, you get it from this (means, medians or modus).Distributions of derived quantities: For instance: discharge at a given stage, when Q(h)=C(h-h0)b A prior prediction distribution, called the marginal likelihood or the model likelihood. f(D) gives the probability of getting data outcomes unconditioned on the parameter values (only conditioned on our pre-knowledge). Used in model comparison.

A posterior prediction distribution, f(Dnew|D), the probability for new data outcomes, given the old data. (This is an example of a derived quantity). This thus takes into account the parameter uncertainty after the data has been handled.

PS: A old posterior distribution will be the prior distribution when we want to handle new data. The old posterior prediction distribution will be the new prior predictive distribution.

Bayes formula: (Only one single model here)

Bayesian statistics comparison of probabilitiesWe can see whether a parameter value increases in probability relative to another parameter value:

The parameter value1 increases in probability relative to 2 if f(D| 1,M)>f(D| 2,M), i.e. if the data is more probable for parameter value 1 than for 2.

The same goes for models:

A model increases in probability relative to another if the data is more probable (irrespective of the parameter values) for that model than for the other, Pr(D|M1)>Pr(D|M2).

Most importantly: One do not gain anything from absolute probabilities. Its only by comparing probabilities that you learn something!

Bayes formula:

Bayesian statistics model comparisonTechnically, we do model comparison by using Bayes formula:

The engine in this inference is the marginal likelihood (prior predictive distribution) f(D|M). When we compare these, we can get evidence for one model or the other.

Since prediction strength is the key, overcomplicated models (having larger parameter uncertainties) are naturally penalized without having to do any extra work!

Ex: Extrasensory perception:Using answers of whether the experimenter had his hand over the right or left hand of the subject, gave 18 correct answer out of 30 questions. Assuming independence of answers, we get the binomial distribution with either p=0.5 (no), or unknown (yes) uniformly distributed success rate.

Can show that the prior predictive distribution is uniform also, giving equal probability to all outcomes.

Any outcome between 11 and 19 will be evidence for p=0.5 (see plot), 18 correct answers are thus more likely with random guessing than with extrasensory perception. Prior predictive distribution for p=0.5 (red ) and p unknown (blue)Bayesian model averageOne can make distributions of any derived quantity, unconditioned on the parameters (prior and in this case posterior prediction distributions):

Example: Stage-discharge rating curve conditioned only on the data and the number of segments, not the rating curve parameters.

In the same fashion, can find the distribution of a derived quantity even unconditioned on the model:

Example: The stage-discharge rating curve given the data (but not conditioned on the parameters nor the number of segments).

(From the law of total probability)

Bayesian vs frequentist the pragmatic aspectWhen the model complexity is below a certain threshold, frequentist methods are typically easier. Above that threshold, Bayesian methods become easier.

ComplexityWorkBayesianFrequentistSimulation and the law of large numbers Assume you are interested in the properties of a stochastic variable (probabilities, mean, quantiles, standard deviation etc). Assume further that you can calculate these things analytically. What you however can do is to sample from that variable.

With enough samples (an ensemble), you can estimate probabilities, means, quantiles and standard deviations. Ex:Calculate the probability of getting yatzi from an algorithm for handling dice throws and the rules of yatzi. Estimate the probability of an error situation in a production system, given the error rates of each component of that system. Calculate the expected discharge from an ensemble of equally probable weather forecasts.Find the number of data necessary to decrease the uncertainty of a parameter below a given value with a given probability. Find the properties of the posterior distribution given samples from it (via MCMC sampling).

Bayesian statistics numerical methods: MCMCReminder, Bayes formula (for only one model):

A normalization constant is a number in a distribution which do not depend on whatever you are taking the distribution over (in this case the parameter set, ). In this case, f(D) is an unknown normalization constant.

A Markov chain (more about that later) is a time series where the values now depend only on the previous value. Some such time series stabilize to some distribution when running for enough time

It is possible to make a Markov chain that has the stationary distribution equal to the distribution youre after, without knowing the normalization constant. This is called MCMC (Markov chain Monte Carlo).

WinBUGS is a system which automatically runs MCMC sampling given a model, a prior distribution and the data (Alt: Make your own MCMC module in R).

Marginal distribution: This rascal is problematic. Not all integrals can be calculated analytically.

Bayesian statistics more MCMCGenerally, an MCMC routine goes like this:Make a starting parameter set, old.Find a way (a proposal distribution*) to sample a new parameter set given the old: new~g(new| old)Accept the new parameter set with probability use the old set if not.Go back to 2 as many times as you want

PS: Normalization disappearsBurn-inImportant concepts:Burn-in: Number of samples needed before the time series converges towards the stationary distribution. Spacing: Number of samples needed before you can keep one as an approximately independent.

spacing* The proposal distribution determines how efficient the algorithm is.

RegressionRegression is when one stochastic variable (the response) depends on other variables (covariates / explanation variables). A part of the variation in the response variable is thus explained by the variation in the other variables. Example: Body weight (response) versus height (covariate)heightweightLinear regressionA linear regression examines the linear relationship between the response and one or more covariates: Y=0+1x1+2x2++pxpNote that the model is linear in the regression parameters, 0,,p, but not necessarily in the covariates. So the model Y= 0+1x+2x2 is a linear model. The statistical model behind this is the following:

is independent noise.

Linear regression example with only one covariateThe regression parameters, a and b, can be fitted to the data using for instance ML-estimation.

The graph shows the adapted regression.

The model is weird though, since it allows for negative expected weights and weight measurements (because of the assumption of normality).

One can save the situation by doing a log-transform on both response and covariate. This means a power-law on the original scale:

heightweightweightheight

Linear regression with only one covariateA regression with only one covariate is easy to represent both mathematically, Y=+x, and graphically. Some terminology:

For a single covariate, the correlation between actual and fitted response is equal to the correlation between (actual) response and covariate.

The regression coefficient is related to the correlation in a simple manner:

This also goes for the data estimates (estimated regression parameter vs empirical correlation and empirical standard deviations).

heightweight Fitted responses are when you use the regression line for the actual data. A residual is the difference between actual and fitted response.xActual responseFittedresponse

residualMultivariate linear regression an exampleExample: Tree volume as a function of tree height and diameter. If we log-transform everything and do a linear regression, log(Vi)=0+1log(Hi)+2log(Di)+i, thats the same as searching for the following expression on the original scale:

In R we get the following output:If you have more than one covariate, this is not a problem for linear regression (though presenting the results in a graph is then problematic).

DiameterHeight Estimate Std. Error t value Pr(>|t|) (Intercept) -6.64580 0.81473 -8.157 9.23e-09 ***ld 1.98982 0.08026 24.793 < 2e-16 ***lh 1.11597 0.20791 5.368 1.14e-05 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.08275 on 27 degrees of freedomMultiple R-squared: 0.975, Adjusted R-squared: 0.9731 F-statistic: 526.4 on 2 and 27 DF, p-value: < 2.2e-16 This tells us that the estimated relationship is:log(V)=-6.64+1.99*log(H)+1.12*log(D) or V(D,H)D1.99*H1.12.012Multivariate linear regression more on the R output Estimate Std. Error t value Pr(>|t|) (Intercept) -6.64580 0.81473 -8.157 9.23e-09 ***logdiameter 1.98982 0.08026 24.793 < 2e-16 ***logheight 1.11597 0.20791 5.368 1.14e-05 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.08275 on 27 degrees of freedomMultiple R-squared: 0.975, Adjusted R-squared: 0.9731 F-statistic: 526.4 on 2 and 27 DF, p-value: < 2.2e-16 012Covariates (really parameters)Standard errors (standard deviation of estimator)t-value=estimate/standard error = how many standard deviations away from 0 is the estimateP-value for hypothesis, =0 R-squared is often called goodness of fit. It is the squared correlation between fitted response and actual response. It is also the amount of variance in the response explained by the regression. The closer to 1 this is, the better the fit.Test of variance (ANOVA) of whether there is anything significant in the relationship between response and covariates.Conclusion:1=1=0 can be ruled out here. But 1 is very near 2 and 2 is less than a standard error away also. Thus the geometrical relationship VD2H can not be ruled out.ML estimatesANOVAAnalysis of variance is performed in order to Check whether a continuous outcome is different for different categories (one-way ANOVA), when you have a single set of categories. If it depends linearly or non-linearly (interaction) on two sets of categories (two-way ANOVA). Technically it is a sub-method of linear regression, where all covariates are discrete. The tests are performed by comparing various ways to estimate the variance from residuals and from category differences.

Linear regression What happens when one run amok in covariatesAs an example, lets put some higher order polynomial terms in the weight-to-height regression:

The fit will improve, but the ability of the regression to predict new data can easily decline. The relationship becomes more and more chaotic, because the parameter uncertainties are increasing.

With the possibilities in linear regression, one can be tempted to just add more and more covariates. The fit will always improve.heightweightHow to avoid running amok?There are three strategies to avoid running amok in covariates:

Thing about the nature of the data (are the quantities strictly positive) and what you want to do with your regression. Use hypothesis testing or other model choice techniques to limit the complexity. (PS: R reports p values for all regression parameters).

Point 2 can be done bystarting with a simple model and add the most significant covariates until no significant covariates remain. starting with a sufficiently complex model and remove the most insignificant covariates until only significant covariates are left. running through all regression models and calculate information criteria. (Not recommended when the number of possible covariates is large.)using Bayesian methodology (similar to point a, b, c).

UncertaintyThe estimators in the regression comes with a certain uncertainty (standard error in frequentist theory or posterior distribution in Bayesian). This is reported by R.

When the confidence interval of a parameter encompasses zero, one cannot reject the hypothesis that the corresponding covariate has no effect.

Uncertainty in regression parameters affect the uncertainty of the expected response as a function of the covariate(s).

Predictions of new measurements have in addition an uncertainty in the measurement noise also:

It is therefore important to separate between estimation uncertainty and prediction uncertainty in regression!

heightweightSimulateddatasetEstimationuncertaintyPrediction uncertainty

Prediction uncertainty

ResidualsA residual is the difference between actual response and the fitted response. Its an estimate of the noise term. The residuals can give a hint about whether the model assumptions are valid or not.A clear trend in the residuals against any covariate show that the function itself is wrong. A clear trend of the residuals in time suggest that one is dealing with time series or that there are gradual changes in important unmeasured covariates. If the residuals do not appear to be normally distributed, a transformation might be in order or a completely different type of regression may be needed. If the variation in the residual has a trend (heteroscedasticity), the noise terms are wrongly. Remodelling or data transformation might be necessary. Data+regressionresidualsData+regressionresiduals

QQ plot

Data+regressionresidualsNon-normal regression Generalized linear modelsSometimes the nature of the response is such that the normal distribution just isnt appropriate. The prime example of this is with counting data.

If your response is of the type k outcomes of a particular type out of n trials for covariate x, then a binomial model for the response is appropriate. If your response is of the type k outcomes for covariate x (no upper limit for k), then a Poisson model can be appropriate.

GLM models are made first by assigning a distribution (normal, binomial, Poisson). The you transform the salient parameter of that distribution (expectancy, success rate, rate) to something that can take values on the real line (this is called the link). That transformed parameter is then given as a linear model, 0+1x1+2x2++pxp.

Since this type of analysis is so common, it has a name (GLM) and ready-made methods in R (called glm).

GLM with a binomial model is often called logistic regression (due to the standard transformation type), while GLM with the Poisson model is called Poisson regression.

Non-linear regressionSometimes its simply not reasonable to have a linear relationship between response and covariates. The nature of the data might suggest a different form.

An example is stage-discharge rating curves with unknown zero planeQ=C(h-h0)bIf h0 was known, a log-transform would make this into a linear relationship. But when you dont have h0, then this equation will also be non-linear:q=a+b*log(h-h0)

ML optimization is still possible, but only with numerical methods. In rating curve analysis, you can actually solve for a and b analytically, so that only h0 is optimized numerically.

For more complicated models, sophisticated optimization methods or MCMC may be necessary. One danger with non-linear regressions is that the likelihood can have multiple peaks (multimodality). This is the case for multi-segmented rating curves.Rating curve estimation at Gryta

Looks like we can optimize the log-likelihood (and thus the likelihood) with a value for h0 close to zero.

A closer looks reveals that the optimal h0 is +8cm.Lets look at the station Gryta, without assuming h0=0.

We can use brute force, by looking at an interval of possible h0 values going, hm, to hm-100m in steps of 1cm.

Please note the previously mentioned phenomena that some likelihoods get better the lower values you have for h0.

Bayesian regression

In VFKURVE3, one sets the prior distribution (or the hyperparameters) in a separate window.

Note that in Bayesian statistics, there are fewer problems concerning the handling of multimodality. Simulation from the posterior distribution becomes slightly more difficult, but there are efficient ways of dealing with the problem. Lets take another look at the station Gryta. Under Bayesian regression, a pre-knowledge is assumed to exist. This can be retrieved from the collection of previously made rating curves (natures prior). But for Gryta, we know that the datum is set so that h00 and since its a weir with a V notch, we know that b 2.5 ought to be approximately true (from hydraulic theory).

Bayesian regression (2)When one performs the analysis, the result is a lot of samples form the posterior distribution. In addition to estimates, you also get a notion of the parameter uncertainty.

For parameters where we have assigned a sharp pre-knowledge with most of the probability mass within a small interval, the posterior distribution will typically be inside that interval also. (If not, we have prior-data conflict).

Since the parameters has a distribution then so also does the rating curve.

With lots of data and/or good prior knowledge, the curve uncertainty can get quite small. Generalized additive modelsGeneralized additive models are models where the response is explained by the added effect of functions of each covariate:

The functions are not known but can be arbitrarily complicated splines. A penalty term for spline complexity is added to the likelihood.

This makes this a borderline Bayesian inference, since a penalty terms function functions in all respects like a prior distribution.

In R this is implemented as gam in the mgcv library. gam(y~x1+s(x2))Says that covariate x1 will be included linearly while covariate x2 will be given a generalized additive treatment.

When the number of covariates is high compared to the number of dataSometimes we know that there ought to be a relationship between response y and covariates x1,,xk. But if the number of measurements is low, it can be difficult to get reliable estimates for the regression. When n