Upload
lester-black
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Statistical & Uncertainty AnalysisUC Summer REU Programs
Instructor: Lilit Yeghiazarian, PhD Environmental Engineering Program
1
Instructor
Dr. Lilit Yeghiazarian
Environmental Engineering Program
Office: 746 Engineering Research Center (ERC)
Email: [email protected]
Phone: 513-556-3623
2
Textbooks
Applied Numerical Methods with MATLAB for Engineers and Scientists, 3rd edition, S.C. Chapra, McGraw-Hill Companies, Inc., 2012
An Introduction to Error Analysis:
The Study of Uncertainties in Physical Measurements, 2nd editions, J.R. Taylor, University Science Books, Sausalito, CA
3
Outline for today
Error numerical error data uncertainty in measurement error
Statistics & Curve Fitting mean standard deviation linear regression t-test ANOVA
4
Types of Error
General Error (cannot blame computer) :
Blunders human error
Formulation or model error incomplete mathematical model
Data uncertainty limited to significant figures in physical measurements
Numerical Error:
Round-off error (due to computer approximations)
Truncation error (due to mathematical approximations)
5
6
Gare Montparnasse, Paris, 1895
Accuracy and Precision
Figure 4.1, Chapra
(a) inaccurate and imprecise
(b) accurate and imprecise
(c) inaccurate and precise
(d) accurate and precise
Note: Inaccuracy = bias Imprecision = uncertainty
7
• Accuracy: how closely a computed/measured value agrees with true value
• Precision: how closely individual computed/measured values agree with each other
Accuracy and Precision
Figure 4.1, Chapra
(a) inaccurate and imprecise
(b) accurate and imprecise
(c) inaccurate and precise
(d) accurate and precise
Note: Inaccuracy = bias Imprecision = uncertainty
8
• Inaccuracy: systemic deviation from truth
• Imprecision: magnitude of scatter
Error, Accuracy and Precision
9
In this class we refer to Error as collective term to represent both inaccuracy and imprecision of our predictions
Round-off Errors
Occur because digital computers have a limited ability to represent numbers
Digital computers have size & precision limits on their ability to represent numbers
Some numerical manipulations highly sensitive to round-off errors arising from
mathematical considerations and/or performance of arithmetic operations on computers
10
Computer Representation of Numbers
Numerical round-off errors are directly related to way numbers are stored in computer
The fundamental unit whereby information is represented is a word
A word consists of a string of binary digits or bits
Numbers are stored in one or more words, e.g.,
-173 could look like this in binary on a 16-bit computer:
(10101101)2=27+25+23+22+20=17310
off “0” on “1”
11
- Overflow Overflow +
1.797693134862316 x 10308-1.797693134862316 x 10308
2.225073858507201 x 10-308-2.225073858507201 x 10-308
15 significantfigures
15 significantfigures
As good as it gets on our PCs …
“Hole”on eitherside of
zero
0
Underflow
For 64-bit, IEEE double precision format systems12
Implications of Finite Number of bits (1)
Range Finite range of numbers a computer can represent
Overflow error – bigger than computer can handle For double precision (MATLAB and Excel): >1.7977 x 10308
Underflow error – smaller than computer can handle For double precision (MATLAB and Excel): <2.2251 x 10-308
Can set format long and use realmax and realmin in MATLAB to test your computer for range
13
Implications of Finite Number of bits (2)
Precision
Some numbers cannot be expressed with a finite number of significant figures, e.g., π, e, √7
14
Round-Off Error andCommon Arithmetic Operations
Addition Mantissa of number with smaller exponent is modified so both
are the same and decimal points are aligned Result is chopped Example: hypothetical 4-digit mantissa & 1-digit exponent computer 1.557 + 0.04341 = 0.1557 x 101 + 0.004341 x 101 (so they have same exponent)
= 0.160041 x 101 = 0.1600 x 101 (because of 4-digit mantissa)
Subtraction Similar to addition, but sign of subtrahend is reversed Severe loss of significance during subtraction of nearly equal
numbers → one of the biggest sources of round-off error in numerical methods – subtractive cancellation
15
Even though an individual round-off error could be small, the cumulative effect over the course of a large computation can be significant!!
Large numbers of computations Computations interdependent Later calculations depend on results of earlier ones
Round-Off Error andLarge Computations
16
Particular Problems Arising from Round-Off Error (1) Adding a small and a large number
Common problem in summing infinite series (like the Taylor series) where initial terms are large compared to the later terms
Mitigate by summing in the reverse order so each new term is comparable in size to the accumulated sum (add small numbers first)
Subtractive cancellation Round-off error induced from subtracting two nearly equal
floating-point numbers Example: finding roots of a quadratic equation or parabola Mitigate by using alternative formulation of model to minimize
problem 17
Particular Problems Arising from Round-Off Error (2) Smearing
Occurs when individual terms in a summation are > summation itself (positive and negative numbers in summation)
Really a form of subtractive cancellation – mitigate by using alternative formulation of model to minimize problem
Inner Products Common problem in solution of simultaneous linear algebraic
equations Use double precision to mitigate problem (MATLAB does this
automatically)
nni
n
ii yxyxyxyx
...22111
18
Truncation Errors
Occur when exact mathematical formulations are represented by approximations
Example: Taylor series
19
Taylor series expansionswhere h = xi+1 - xi
4)4(
3)3(
2"
'1 )(
!4
)()(
!3
)()(
!2
)())(()()( h
xfh
xfh
xfhxfxfxf iii
iii
3)3(
2"
'1 )(
!3
)()(
!2
)())(()()( h
xfh
xfhxfxfxf ii
iii
2"
'1 )(
!2
)())(()()( h
xfhxfxfxf i
iii
)()( 1 ii xfxf
))(()()( '1 hxfxfxf iii
0th
1st
2nd
3rd
4th
Taylor series widely used to express functions in an approximate fashion
Taylor’s Theorem: Any smooth function can be approximated as a polynomial
20
Each term adds more information:
Fig
ure
4.6
, Ch
ap
ra, p
. 9
3
= 1.2
e.g., f(x) = - 0.1x4 - 0.15x3 - 0.5x2 - 0.25x + 1.2 at x = 1
≈ 1.2 – 0.25(1) = 0.95
≈ 1.2 – 0.25(1) –(1.0/(1*2))*12 = 0.45
= 1.2 – 0.25(1) – (1.0/(1*2))*12
– (0.9/(1*2*3))*13 = 0.3
21
Total Numerical Error
Sum of round-off error and truncation error
As step size ↓, # computations round-off error (e.g. due
to subtractive cancellation or large numbers of computations)
truncation error ↓ Point of diminishing returns is when
round-off error begins to negate benefits of step-size reduction
Trade-off here Figure 4.10, Chapra, p. 104
22
Control of Numerical Errors Experience and judgment of engineer
Practical programming guidelines: Avoid subtracting two nearly equal numbers Sort the numbers and work with the smallest numbers first
Use theoretical formulations to predict total numerical errors when possible (small-scale tasks)
Check results by substituting back in original model and see if it actually makes sense
Perform numerical experiments to increase awareness Change step size or method to cross-check Have two independent groups perform same calculations
23
Measurements & Uncertainty
24
Errors as Uncertainties
Error in scientific measurement means the inevitable uncertainty that accompanies all measurements
As such, errors are not mistakes, you cannot eliminate them by being very careful
The best we can hope to do is to ensure that errors are as small as reasonably possible
In this section, words error and uncertainty are used interchangeably
25
Inevitability of Uncertainty
Carpenter wants to measure the height of doorway before installing a door
First rough measurement: 210 cm If pressed, the carpenter might admit that the
height in anywhere between 205 & 215 cm For a more precise measurement, he uses a
tape measure: 211.3 cm How can he be sure it’s not 211.3001 cm? Use a more precise tape?
26
Measuring Length with Ruler
27
Measuring Length with Ruler
28
Measuring Length with Ruler
29
Best estimate of length = 82.5 mmProbable range: 82 to 83 mmWe have measured the length to the nearest millimeter
Note: markings are 1 mm apart
How To Report & Use Uncertainties
Best estimate ± uncertainty In general, the result of any measurement
of quantity x is stated as
(measured value of x) = xbest ± Δx Δx is called uncertainty, or error, or margin
of error Δx is always positive
30
Basic Rules About Uncertainty
Δx cannot be known/stated with too much precision; it cannot conceivably be known to 4 significant figures
Rule for stating uncertainties: Experimental uncertainties should almost always be rounded to one significant figure
Example: if some calculation yields Δx=0.02385, it should be rounded to Δx=0.02
31
Basic Rules About Uncertainty
Rule for stating answers: The last significant figure in any stated answer should usually be of the same order of magnitude (in the same decimal position) as the uncertainty
Examples: The answer 92.81 with uncertainty 0.3 should be
rounded as 92.8 ± 0.3 If the uncertainty is 3, then the answer should be
rounded as 93 ± 3 If the uncertainty is 30, then the answer should be
rounded as 90 ± 30 32
Propagation Of Uncertainty
33
Statistics & Curve Fitting
34
Curve Fitting Could plot points and
sketch a curve that visually conforms to the data
Three different ways shown:
a) Least-squares regression for data with scatter (covered)
b) Linear interpolation for precise data
c) Curvilinear interpolation for precise data
Figure P
T4.1,
Chapra
Curve Fitting and Engineering Practice
Estimation of intermediate numbers from tables in design handbooks → interpolation
Trend analysis – use pattern of data to make predictions: Imprecise or “noisy” data → regression (least-squares) Precise data → interpolation (interpolating polynomials)
Hypothesis testing – compare existing mathematical model with measured data Determine unknown model coefficient values … or … Compare predicted values with observed values to test
model adequacy
You’ve Got a Problem …
Wind tunnel data relating force of air resistance (F) to wind velocity (v) for our friend the bungee jumper
The data can be used to discover the relationship and
find a drag coefficient (cd), i.e., As F , v Data is not smooth, especially
at higher v’s If F = 0 at v = 0, then the
relationship may not be linear
Figure 13.1, Chapra
Figure 13.2, Chapra
especially if you are this guy
How to fit the “best” line or curve to these data?
Before We Can Discuss Regression Techniques … We Need To Review
basic terminologydescriptive statistics
for talking about sets of data
Basic Terminology
Data from TABLE 13.3
Maximum?Minimum?
6.7756.395
Range?6.775 - 6.395 = 0.380
Individual data points, yi
y1 = 6.395 y2 = 6.435
↓ y24 = 6.775
Number of observations?Degrees of freedom?
n = 24n – 1 = 23
Residual? yy i
Use Descriptive Statistics To Characterize Data Sets:
Location of center of distribution of the dataArithmetic mean Median (midpoint of data, or 50th percentile)Mode (value that occurs most frequently)
Degree of spread of the data setStandard deviationVarianceCoefficient of variation (c.v.)
Arithmetic Mean
n
yy i
6.624
4.158y
Data from TABLE 13.3
Standard Deviation
1
)(
1
2
n
yy
n
Ss it
y
097133.0124
217000.0
ys
Data from TABLE 13.3
St: total sum of squares of residuals between data points and mean
Variance
1
/)(
1
)(
122
22
n
nyy
n
yy
n
Ss
ii
ity
009435.0124
217000.02
ys
Data from TABLE 13.3
Coefficient ofVariation (c.v.)
%47.1%1006.6
097133.0c.v.
%100c.v.y
sy
Data from TABLE 13.3
c.v. = standard deviation / meanNormalized measure of spread
Fig
ure
12
.4, C
ha
pra
Histogram of data
For a large set of data,histogram can beapproximated by
a smooth, symmetricbell-shaped curve
→ normal distribution
Data from TABLE 13.3
Confidence Intervals
If a data set is normally distributed, ~68% of the total measurements will fall within the range defined by
Similarly, ~95% of the total measurements will be encompassed by the range
yy sysy to
yy sysy 2to2
Descriptive Statistics in MATLAB>>% s holds data from Table 13.2>>s=[6.395;6.435;6.485;…;6.775]
>>[n,x]=hist(s)n = 1 1 3 1 4 3 5 2 2 2
x = 6.414 6.452 6.49 6.528 6.566 6.604 6.642 6.68 6.718 6.756
>>min(s), max(s)
ans = 6.395ans = 6.775
>>range=max(s)-min(s)
range = 0.38
>>mean(s), median(s), mode(s)ans = 6.6ans = 6.61ans = 6.555
>>var(s), std(s)
ans = 0.0094348ans = 0.097133
n is the number of elements in each bin; x is a vector specifying the midpoint of each bin
Back to the Bungee Jumper Wind Tunnel Data … is the mean a good fit to the data?
Figure 13.8a, Chapra
velocity, m/s Force, N10 2520 7030 38040 55050 61060 122070 83080 1450
mean: 642
Figure 12.1, Chapra
Figure 13.2, Chapra
not very!!! distributionof residuals is large
Curve Fitting Techniques
Least-squares regression Linear Polynomial General linear least-squares Nonlinear
Interpolation (not covered) Polynomial Splines
Figure 12.8, Chapra, p. 209Figure 12.8, Chapra, p. 209
Can reduce the distributionof the residuals if use curve-
fitting techniques such aslinear least-squares regression
Figure 13.8b, Chapra
Linear Least-Squares Regression
Linear least-squares regression, the simplest example of a least-squares approximation is fitting a straight line to a set of paired observations:
(x1, y1), (x2, y2), …, (xn, yn)
Mathematical expression for a straight line:
y = a0+ a1x + e
intercept slopeerror or residual
Least-Squares Regression: Important Statistical Assumptions
Each x has a fixed value; it is not random and it is known without error, this means that
x values must be error-free regression of y versus x is not the same as the
regression of x versus y
The y values are independent random variables and all have the same variance
The y values for a given x must be normally distributed
Residuals in Linear Regression
Regression line is a measure of central tendency for paired observations (i.e., data points)
Residuals (ei) in linear regression represent the vertical distance between a data point and the regression line
Figure 13.7, Chapra
How to Get the “Best” Fit:
Minimize the sum of the squares of the residuals between the measured y and the y calculated with the (linear) model:
210
1
2,
1,
1
2
)(
)(
i
n
ii
modeli
n
imeasuredi
n
iir
xaay
yy
eS
Yields a unique line for a given dataset
How do we compute the best a0 and a1?
210
1
)( i
n
iir xaayS
One way is to use optimization techniques since looking for a minimum (more common for nonlinear
case)… or …
Another way is to solve the normal equations for a0 and a1 according to the derivation in the next few
slides
Derivation of Normal Equations Used to
Solve for a0 and a1
210
1
)( i
n
iir xaayS
First, differentiate the sum of the squares of the residuals with respect to each unknown coefficient
Derivation of Normal Equations Used to Solve for a0 and a1 - continued
Set derivatives = 0
Will result in a minimum Sr
Can be expressed as
0)][(2
0)(2
101
100
iiir
iir
xxaaya
S
xaaya
S
210
10
0
0
iiii
ii
xaxaxy
xaay
Derivation of Normal Equations Used to Solve for a0 and a1 - continued
210
10
0
0
iiii
ii
xaxaxy
xaay
iiii
ii
yxaxax
yaxna
12
0
10
Realizing ∑a0 = na0, we can express these equations as a set of two simultaneous linear equations with two unknowns (a0 and a1) called the normal equations:
Normal Equations:
Derivation of Normal Equations Used to Solve for a0 and a1 - continued
xaya
xxn
yxyxn
a
ii
iiii
102
2
1 and
iiii
ii
yxaxax
yaxna
12
0
10
Finally, can solve these normal equations for a0 and a1
ii
i
ii
i
yx
y
a
a
xx
xn
1
02
Improvement Due to Linear Regression
Figure 12.8, Chapra, p. 209
mean of the dependentvariable
best-fit linesy
sy/xFigure 12.8, Chapra, p. 209
mean of the dependentvariable
best-fit linesy
sy/x
Spread of data around the mean of the dependent variable
Figure 13.8, Chapra
Spread of data around the best-fit line
The reduction in spread in going from (a) to (b), indicated by bell-shape curves, represents improvement due to linear regression
Improvement Due to Linear Regression
210
1
)( i
n
iir xaayS
2
1
)( yySn
iit
Figure 12.8, Chapra, p. 209
mean of the dependentvariable
best-fit linesy
sy/xFigure 12.8, Chapra, p. 209
mean of the dependentvariable
best-fit linesy
sy/x
Coefficient of determination quantifies improvement or error reduction due to describing data in terms of a straight line rather
than an average value
Total sum of squares around the mean of dependent variable
Sum of squares of residuals around the best-fit
regression line
t
rt
S
SSr
2
St - Sr quantifies error reduction due to using line instead of mean
Normalize by St because scale sensitive → r 2 = coefficient of determination
Used for comparison of several regressions
Value of zero represents no improvement
Value of 1 is a perfect fit, the line explains 100% of data variability
“Goodness” of Fit
Figure 12.9, Chapra
small residual errorsr2 → 1
large residual errorsr2 << 1
t
rt
S
SSr
2
Linearization of Nonlinear Relationships
What to do when relationship is nonlinear?
One option is polynomial regression
Another option is to linearize the data using transformation techniques Exponential Power Saturation-growth-rate
Figure 14.1, Chapra
Linearization Transformation Examples Exponential model:
Used to characterize quantities that increase (+β1) or decrease
(-β1) at a rate directly proportional to their own magnitude, e.g., population growth or radioactive decay
Take ln of both sides to linearize data:
Figure 13.11,
Chapra
xey 11
xy 11lnln
Linearization Transformation Examples
Power model:
Widely applicable in all fields of engineering
Take log of both sides to linearize data:
22
xy
Figure 13.11,
Chapra
22 logloglog xy
Linearization Transformation Examples Saturation-growth-rate model
Used to characterize population growth under limiting conditions or enzyme kinetics
To linearize, invert equation to give:
x
xy
33
Figure 13.11,
Chapra
33
3 111
xy
Linear Regression with MATLAB Polyfit can be used to determine the slope
and y-intercept as follows:
>>x=[10 20 30 40 50 60 70 80];>>y=[25 70 380 550 610 1220 830 1450];>>a=polyfit(x,y,1)a = 19.4702 -234.2857
Polyval can be used to compute a value using the coefficients as follows:
>>y = polyval(a,45)y = 641.8750
use 1 for linear (1st order)
Polynomial Regression
Extend the linear least-squares procedure to fit data to a higher order polynomial
For a quadratic (2nd order polynomial), will have a system of 3 normal equations to solve instead of 2 as for linear
For higher mth-order polynomials, will have a system of m+1 normal equations to solve
Figure 14.1, Chapra
Data not suited for linear least-squares regression
Nonlinear Regression (not covered)
Cases in engineering where nonlinear models – models that have a nonlinear dependence on their parameters – must be fit to data
For example,
Like linear models in that we still minimize the sum of the squares of the residuals
Most convenient to do this with optimization
errorxa eeaxf )1()( 1
0
More Statistics: Comparing 2 Means
69
depends on mean and amount of variability
can tell there is a difference when variability is low
use t-test to do this mathematically
http://www.socialresearchmethods.net/kb/stat_t.php
The t-test Is A Ratio Of “Signal To Noise”
70
Remember, variance (var) is just thesquare of the standard deviation
http://www.socialresearchmethods.net/kb/stat_t.php
Standard Error of difference between means
How It Works ... Once You Have A t-value
71
Need to set a risk level (called the alpha level) – a typical value is 0.05 which means that five times out of a hundred you would find a statistically significant difference between the means even if there was none (i.e., by "chance").
Degrees of freedom (df) for the test = sum of # data points in both groups (N) minus 2.
Given the alpha level, the df, and the t-value, you can use a t-test table or computer program to determine whether the t-value is large enough to be significant.
If calculated t is larger than t (alpha, N-2) in table, you can conclude that the difference between the means for the two groups is different (even given the variability).
http://www.socialresearchmethods.net/kb/stat_t.php
What About More Than 2 Sets Of Data?
72
ANOVA = Analysis of Variance commonly used for more than 2, if ... k samples are random and independently selected treatment responses are normally distributed treatment responses have equal variances
ANOVA compares variation between groups of data to variation within groups, i.e.,
Source: http://www.uwlax.edu/faculty/baggett/Math_145/HANDOUTS/anova.pdf
Steps for ANOVA
73
Define k population or treatment means being compared in context of problem
Set up hypotheses to be tested, i.e., H0: all means are equal
Ha: not all means are equal (no claim as to which ones not)
Choose risk level, alpha (=0.05 is a typical level) Check if assumptions reasonable (previous slide) Calculate the test statistic ... pretty involved ...see next page!
Note: usually use a computer program, but is helpful to know what computer is doing ... can do simple problems by hand
Source: http://www.uwlax.edu/faculty/baggett/Math_145/HANDOUTS/anova.pdf
ANOVA calculations
74Source: http://www.uwlax.edu/faculty/baggett/Math_145/HANDOUTS/anova.pdf
Collect this info from data set:
ANOVA calculations
75Source: http://www.uwlax.edu/faculty/baggett/Math_145/HANDOUTS/anova.pdf
Fill out a table like this to compute the F ratio statistic:
ANOVA calculations
76Source: http://www.uwlax.edu/faculty/baggett/Math_145/HANDOUTS/anova.pdf
Now what do we do with F statistic ? Compare it to an F distribution like we did with the t-test This time we need
alpha df of numerator (k-1) df of denominator (N-k)
to look up F (1- alpha)(k-1, N-k)
This time we need to compare F ≥ F (1- alpha)(k-1, N-k)
If yes, then more evidence against H0, reject H0
The End
77