20
Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If you know enough about statistics, you don’t need much data.” Probability distribution Problems with statisticians’ notation Hypothesis testing Regression analysis Model fitting Outlier rejection Data presentation Experimental design Sir Ronald Aylmer Fisher (1890-1962)

Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Embed Size (px)

Citation preview

Page 1: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Statistics Overview

Biologists say, “If you need to use statistics, you don’t have enough data.”

Engineers say, “If you know enough about statistics, you don’t need much data.”

• Probability distribution

• Problems with statisticians’ notation

• Hypothesis testing

• Regression analysis

• Model fitting

• Outlier rejection

• Data presentation

• Experimental designSir Ronald Aylmer Fisher

(1890-1962)

Page 2: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Probability Distribution Functions

If I make a measurement of a variable, how do I know how that sample relates to the mean?

y(x)

x

p[y

(x)]

y(x)

µ

Page 3: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Probability p that a value selected at random from a Gaussian distribution with mean μ and variance σ2 will have value x

µ is the mean of the distribution, given by

σ is the standard deviation of the distribution, given by

Probability P that random variable X will fall between a and b

Normal (Gaussian) Probability

µ= 0; σ= 1

µ= 0; σ= 2

µ= 0; σ= 3

µ= 4; σ= 1

Central Limit Theorem

s 2 is called the variancem and s 2 are first two moments of the PDF

Page 4: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Other Distributions

Continuous

• Normal (Gaussian), Cauchy, Chi-square, exponential, F, gamma, Laplace, log-normal, Pareto, Student’s t, uniform, Weibull, Beta

• Von Mises distribution - the independent variable varies from -π ≤ θ ≤ π (i.e. θ is an angle)

Discrete

• Bernoulli, binomial, discrete uniform, geometric, hypergeometric,negative binomial, Poisson

Oriented muscle cells

Page 5: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Statistics Notation

What is written

Random variable x

Probability p(x)

Problem: x is really a dependent variable

How you should read it

Random variable y(x)

Probability p[y(x)]

Probability distributions are valid for one value of the independent variable only

Example

Measure reaction rate constant k at temperatures T1 and T2

Statisticians would note this as measuring random variable k, then compute the probability p(k)

Really, your measurements measured k(T1), so the implied probabilities p[k(T1)] only apply at T=T1

Page 6: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Example

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

Bir

ths

in 2

00

3

Birth Weight (lb)

United States

Germany

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0-1

.1

1.1

-2.2

2.2

-3.3

3.3

-4.4

4.4

-5.5

5.5

-6.6

6.6

-7.7

7.7

-8.8

8.8

-9.9

9.9

-11

.0

>11N

orm

aliz

ed

Bir

ths

in 2

00

3:

p[w

(co

un

try,

ye

ar)]

Birth Weight (lb)

United States

Germany

Data from data.un.org

Page 7: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Hypothesis Testing

Null Hypothesis H0

Assume that two dependent variables are drawn from distributions with the same mean μ

Test this hypothesis with t-test

• t-test gives the probability that the means are “different”

• If they are “different,” then H0 is false

H0 : μ1 = μ2

µ1 = 0σ1 = 2

µ2 = 0σ2 = 1

µ2 = 1σ2 = 1

µ2 = -5σ2 = 1

Page 8: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Testing the Hypothesis

(Student’s) T-testDetermine whether two sets of data come from “different” distributions

• “Different” = “there exists a statistically significant difference between the two”

• Statistical significance based on p value

– significantly different if p < α

– Usually, α = 0.05

• ttest() in Excel

• ttest() or ttest2() in MATLAB

ANalysis Of VAriance (ANOVA)

t-test for the case of more than one independent variable (e.g. y(x,t))

Result is again a p value telling you whether the independent variable makes a statistically significant difference in the dependent variables

Available in the Data Analysis Toolpack in Excel

Page 9: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

T-Tests

http://www.socialresearchmethods.net/kb/stat_t.php

Page 10: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

ANalysis Of VAriance (ANOVA)

See course manual p. 10-11 for a full description

Sum of squares Mean squares

Essentially a t-test for more than two samples

Total

Error/Residual

Treatment

Page 11: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Regression AnalysisLinear Regression

Fit a line to data containing noise using the least squares method

• Minimize the sum of squared residuals

• Model with one independent variable

• Model with p-1 independent variables

• Goodness of fit

– Fraction of variance in data which is explained by model

Nonlinear Regression

Fit an arbitrary function to data containing noise, again using least squares method

R2 isn’t necessarily a good measure of goodness of fit

• L2 norm (Euclidean distance)

• Relative error in L2 norm

Page 12: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Regression AnalysisQualitative Verification

All of these methods assume that the error ε is normally distributed

•Check by looking at plot of residuals

•Residuals should be randomly distributed around axis r = 0

Nonlinear Regression

Fit an arbitrary function to data containing noise, again using least squares method

R2 isn’t necessarily a good measure of goodness of fit

• L2 norm (Euclidean distance)

• Relative error in L2 norm

Page 13: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Model FittingIn Excel

Add trendline – Excel does everything for you

• Only works if you want to use an available function

Goal seek

• Only works for unconstrained, one parameter models

Solver

• Can use for constrained, multiple parameter models

• Uses Quasi-Newton or conjugate gradient method

In MATLAB

Built-in functions

• Newton-Raphson method (fzero)

• Nelder-Mead simplex (fminsearch)

Optimization toolboxes

• Levenberg-Marquadt/Quasi-Newton (fminunc or fmincon)

• Simulated annealing

• Genetic algorithm (GA)

Curve fitting toolbox

Custom algorithm

All methods work by minimizing some error

Page 14: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Model Fitting in Excel

Page 15: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Model Fitting in MATLAB

Page 16: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Outlier RejectionWhat is an outlier?

An outlier is a data point which disagrees with the other data and cannot be reproduced

Caused by measurement error, incorrect value of independent variable (i.e. user error), noise, chance, or lack of control or understanding of the process

Example:

y(x)=[1.2, 1.3, 5.0, 1.1, 1.2]T

μ = 1.96; σ = 1.70

When is a point an outlier?

Dixon’s Q Test

• Very simple – just look up a value in a table to see if it’s an outlier

Chauvenet’s Criterion

• Simple, less rigorous

• If p(xi)<1/(2n), throw it out

Grubb’s Test; Peirce’s Criterion

• Both utilize more rigorous methods

• See paper

Without outlier: μ = 1.20; σ = 0.08

Page 17: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

What makes a good figure?Clearly relates independent and dependent variables using axes and trend

lines

• Units!!

• Proper scaling– Use log scales if variable(s) vary over

orders of magnitude

Symbols and text are large and different

Resolution is sufficiently high

Error bars (if applicable)

Efficient use of space

Utilizes significant figures appropriately

Compares data with applicable model predictions

Contains enough information to get the point(s) across, but not so much that the message is lost or confused

Captioned such that it is understood without reading the text

Page 18: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Reilly et al., Experimental Eye Research, 2008.

Presentation of Data

Which of these figures is better?

ambiguous

Was

ted

sp

ace

Significant figures

Fuzzy text

Error bars

Goodness-of-Fit

LegendFrom a journal article which was rejected.

Units

Page 19: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Presentation of Data

Reilly et al., Biomacromolecules, 2008.

Tiffany and Koretz, International Journal of Biological Molecules, 2002.

Page 20: Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If

Statistical Experimental Design

Design an experiment using statistical methods to minimize the number of data points required to get the desired information.

Analyze an experiment using statistical methods to maximize the information yield from any set of experiments