Upload
enthought-inc
View
7.604
Download
3
Tags:
Embed Size (px)
DESCRIPTION
This webinar will provide an overview of the tools that SciPy and NumPy provide for regression analysis including linear and non-linear least-squares and a brief look at handling other error metrics. We will also demonstrate simple GUI tools that can make some problems easier and provide a quick overview of the new Scikits package statsmodels whose API is maturing in a separate package but should be incorporated into SciPy in the future.
Citation preview
Curve-fitting (regression) with Python
September 18, 2009
Enthought Consulting
Enthought Training Courses
Python Basics, NumPy, SciPy, Matplotlib, Traits, TraitsUI, Chaco…
Enthought Python Distribution (EPD)http://www.enthought.com/products/epd.php
y = mx + b
m = 4.316b = 2.763
Data Model
y =a
(b + ce!dx)a = 7.06b = 2.52c = 26.14d = !5.57
Curve Fitting or Regression?
Carl Gauss
Adrien-MarieLegendre Francis Galton
R.A. Fisher
or (my preferred) ... Bayesian Inference
LaplaceBayes
Richard T. Cox Edwin T. JaynesHarold Jeffreys
p(X|Y) =p(Y|X)p(X)
p(Y)
=p(Y|X)p(X)!p(Y|X)p(X)dX
Model PriorInference
Data
Unknowns
More pedagogy
Curve Fitting RegressionParameter Estimation Bayesian Inference
Understated statistical model
Statistical model is more important
Just want “best” fit to data
Post estimation analysis of error and fit
Machine Learning
Pragmatic look at the methods
• Because the concept is really at the heart of science, many practical methods have been developed.
• SciPy contains the building blocks to implement basically any method.
• SciPy should get high-level interfaces to all the methods in common use.
Methods vary in...
• The model used: – parametric (specific model):– non-parametric (many unknowns)
• The way error is modeled– few assumptions (e.g. zero-mean, homoscedastic)– full probabilistic model
• What “best fit” means (i.e. what is distance between the predicted and the measured). – traditional least-squares– robust methods (e.g. absolute difference)
y = f(x;!)y =
!
i
!i!i(x)
y = y + !
Parametric Least Squares
y = [y0, y1, ..., yN!1]x = [x0, x1, ..., xN!1]! = [!0, !1, ...,!K!1]K < N
y = f(x;!) + "
! = argmin!
J(y,x,!)
! = argmin!
(y ! f(x;!))T W(y ! f(x;!))
T
T
T
Linear Least Squares
y = H(x)! + "
! =!H(x)T WH(x)
"!1H(x)T Wy
yi = ax2i + bxi + c
y =
!
"""#
x20 x0 1
x21 x1 1...
......
x2N!1 xN!1 1
$
%%%&
!
#abc
$
& + !
Quadratic Example:
H(x) !
Non-linear least squares
! = argmin!
J(y,x,!)
! = argmin!
(y ! f(x;!))T W(y ! f(x;!))
Logistic Example
Optimization Problem!!
yi =a
b + ce!dxi
Tools in NumPy / SciPy
• polyfit (linear least squares)• curve_fit (non-linear least-squares)• poly1d (polynomial object)• numpy.random (random number generators)• scipy.stats (distribution objects• scipy.optimize (unconstrained and
constrained optimization)
15
Polynomials
• p = poly1d(<coefficient array>)
• p.roots (p.r) are the roots
• p.coefficients (p.c) are the coefficients
• p.order is the order
• p[n] is the coefficient of xn
• p(val) evaulates the polynomial at val
• p.integ() integrates the polynomial
• p.deriv() differentiates the polynomial
• Basic numeric operations (+,-,/,*) work
• Acts like p.c when used as an array
• Fancy printing
>>> p = poly1d([1,-2,4])>>> print p 2x - 2 x + 4
>>> g = p**3 + p*(3-2*p)>>> print g
6 5 4 3 2x - 6 x + 25 x - 51 x + 81 x - 58 x + 44
>>> print g.deriv(m=2) 4 3 230 x - 120 x + 300 x - 306 x + 162
>>> print p.integ(m=2,k=[2,1]) 4 3 20.08333 x - 0.3333 x + 2 x + 2 x + 1
>>> print p.roots[ 1.+1.7321j 1.-1.7321j]
>>> print p.coeffs[ 1 -2 4]
16
Statisticsscipy.stats — CONTINUOUS DISTRIBUTIONS
over 80 continuous distributions!
cdf
rvs
ppf
stats
fit
sf
isf
METHODSentropy
nnlf
moment
freeze
17
Using stats objects
>>> from scipy.stats import norm# Sample normal dist. 100 times.>>> samp = norm.rvs(size=100)
>>> x = linspace(-5, 5, 100)# Calculate probability dist.>>> pdf = norm.pdf(x)# Calculate cummulative Dist.>>> cdf = norm.cdf(x)# Calculate Percent Point Function>>> ppf = norm.ppf(x)
DISTRIBUTIONS
18
Setting location and Scale
>>> from scipy.stats import norm# Normal dist with mean=10 and std=2 >>> dist = norm(loc=10, scale=2)
>>> x = linspace(-5, 15, 100)# Calculate probability dist.>>> pdf = dist.pdf(x)# Calculate cummulative dist.>>> cdf = dist.cdf(x)
# Get 100 random samples from dist.>>> samp = dist.rvs(size=100)
# Estimate parameters from data>>> mu, sigma = norm.fit(samp)>>> print “%4.2f, %4.2f” % (mu, sigma)10.07, 1.95
NORMAL DISTRIBUTION
.fit returns best shape + (loc, scale)that explains the data
19
Fitting Polynomials (NumPy)
>>> from numpy import polyfit, poly1d>>> from scipy.stats import norm# Create clean data.>>> x = linspace(0, 4.0, 100)>>> y = 1.5 * exp(-0.2 * x) + 0.3# Add a bit of noise.>>> noise = 0.1 * norm.rvs(size=100)>>> noisy_y = y + noise
# Fit noisy data with a linear model.>>> linear_coef = polyfit(x, noisy_y, 1)>>> linear_poly = poly1d(linear_coef)>>> linear_y = linear_poly(x),
# Fit noisy data with a quadratic model.>>> quad_coef = polyfit(x, noisy_y, 2)>>> quad_poly = poly1d(quad_coef)>>> quad_y = quad_poly(x))
POLYFIT(X, Y, DEGREE)
20
Optimization
Unconstrained Optimization• fmin (Nelder-Mead simplex)• fmin_powell (Powell’s method)• fmin_bfgs (BFGS quasi-Newton
method)• fmin_ncg (Newton conjugate
gradient)• leastsq (Levenberg-Marquardt)• anneal (simulated annealing global
minimizer)• brute (brute force global minimizer)• brent (excellent 1-D minimizer)• golden• bracket
Constrained Optimization• fmin_l_bfgs_b• fmin_tnc (truncated Newton code)• fmin_cobyla (constrained optimization by
linear approximation)• fminbound (interval constrained 1D
minimizer)Root Finding• fsolve (using MINPACK)• brentq• brenth• ridder• newton• bisect• fixed_point (fixed point equation solver)
scipy.optimize — Unconstrained Minimization and Root Finding
21
Optimization: Data Fitting
>>> from scipy.optimize import curve_fit# Define the function to fit.>>> def function(x, a , b, f, phi):... result = a * exp(-b * sin(f * x + phi))... return result
# Create a noisy data set.>>> actual_params = [3, 2, 1, pi/4]>>> x = linspace(0,2*pi,25)>>> exact = function(x, *actual_params)>>> noisy = exact + 0.3 * randn(len(x))
# Use curve_fit to estimate the function parameters from the noisy data.>>> initial_guess = [1,1,1,1]>>> estimated_params, err_est = curve_fit(function, x, noisy, p0=initial_guess) >>> estimated_paramsarray([3.1705, 1.9501, 1.0206, 0.7034])
# err_est is an estimate of the covariance matrix of the estimates# (i.e. how good of a fit is it)
NONLINEAR LEAST SQUARES CURVE FITTING
StatsModelsJosef Perktold
Canada
Skipper SeaboldPhD StudentAmerican UniversityWashington, D.C.
Economists
GUI example: astropysics (with TraitsUI)Erik J. Tollerud
PhD StudentUC IrvineCenter for CosmologyIrvine, CA
http://www.physics.uci.edu/~etolleru/
Scientific Python Classes
Sept 21-25 AustinOct 19-22 Silicon ValleyNov 9-12 ChicagoDec 7-11 Austin
http://www.enthought.com/training