Introduction to Stan - University of Colorado Boulderbechtel.colorado.edu/~bracken/tutorials/stan/stan-tutorial.pdf · Introduction to Stan Cameron Bracken University of Colorado

Introduction Bayesian Stats About Stan Examples Tips and Tricks

Introduction to Stan

Cameron BrackenUniversity of Colorado Boulder

February 2015


What is Stan?

“A probabilistic programming language implementing fullBayesian statistical inference with MCMC sampling(NUTS, HMC) and penalized maximum likelihoodestimation with Optimization (L-BFGS)”

!"#$%&'()*+,

!"#$%&'( )$"( "%#*+"')( ,-./0"1)",( /'"( -2( #%1,-0( '%0&*+13( )-( '-*4"( %0%)$"0%)+.%*(&#-5*"0(6%'()$%)(-2(7-0&)"(,"(8/22-1( +1(9::;<(=1()$"(2-**-6+13(1">)()6-."1)/#+"'?()$+'()".$1+@/"($%,(%(1/05"#(-2(-)$"#(/'"'<((=1()$"(9ABC'?(D1#+.-(E"#0+(/'",(+))-('-*4"(&#-5*"0'( +1(1"/)#-1(&$F'+.'?(%*)$-/3$($"(1"4"#(&/5*+'$",($+'( #"'/*)'<( ( =1(G-'H*%0-'(,/#+13(I-#*,(I%#(==?(E"#0+(%*-13(6+)$(J)%1(K*%0?(L-$1(4-1(M"/0%11?(M+.$-*%'N")#-&-*+'?(%1,(-)$"#'(,+'./''",()$"(%&&*+.%)+-1(-2()$+'(')%)+')+.%*('%0&*+13()".$1+@/"()-)$"( &#-5*"0'( )$"F( 6"#"( 6-#O+13( -1<( ( K*%0( &-+1)",( -/)( )$"( /'"( -2( "*".)#-0".$%1+.%*.-0&/)"#'( )-( -4"#.-0"( )$"( *-13( %1,)",+-/'( 1%)/#"( -2( )$"( .%*./*%)+-1'?( %1,N")#-&-*+'(1%0",( )$+'(&#"4+-/'*F(/11%0",)".$1+@/"(PN-1)"(7%#*-Q(%2)"#(K*%0R'(/1.*"6$-( 5-##-6",( 0-1"F( 2#-0( #"*%)+4"'5".%/'"($"(ST/')($%,()-(3-()-(N-1)"(7%#*-QU)$"(3%05*+13(.%'+1-V<

W1( N%#.$( 99?( 9AX:?( L-$1( 4-1M"/0%11('"1)(%(*"))"#(UY+.$)0F"#?(9AX:V()-)$"( Z$"-#")+.%*( [+4+'+-1( *"%,"#( &#-&-'+13)$"(/'"(-2()$+'()".$1+@/"(-1(DM=H7()-('-*4"1"/)#-1( ,+22/'+-1( %1,( 0/*)+&*+.%)+-1&#-5*"0'<( ( Z$+'( 6%'( )$"( 2+#')( &#-&-'%*( )-/'"( )$"( N-1)"( 7%#*-( )".$1+@/"( -1( %1"*".)#-1+.( ,+3+)%*( .-0&/)"#<( ( H*'-( +1( 9AX:?D1#+.-( E"#0+( $%,( EDYN=H7( UE+3/#"( 9V?( %0".$%1+.%*( %1%*-3( .-0&/)"#?( &#-3#%00",)-(#/1(N-1)"(7%#*-(&#-5*"0'<(( =1(9AX\?()$"2+#')( #/1'( -1( %,+3+)%*( .-0&/)"#)--O( &*%."( -1DM=H7( UE+3/#"( ;V<=1( )$"( *%)"( 9AXC'%1,( "%#*F( 9A]C'?0%1F(&%&"#'(6"#"6#+))"1( ,"'.#+5+13)$"( N-1)"( 7%#*-0")$-,( %1,( +)'/'"( +1( '-*4+13&#-5*"0'( +1#%,+%)+-1( %1,&%#)+.*"( )#%1'&-#)%1,( -)$"#( %#"%'<Z$"( 2+#')( -&"1N-1)"( 7%#*-.-12"#"1."( 6%'$"*,( %)( K7GH( +1)$"( '/00"#( -29AXA<( ( N%1F( -2( )$-'"( 0")$-,'( %#"( ')+**( +1( /'"( )-,%F( +1.*/,+13( )$"( #%1,-0( 1/05"#3"1"#%)+-1(0")$-,(/'",(+1(N7M!<

-'./+0%12%%3#45"

-'./+0%62%%7)89%:;8<%&*;='9.%-3>!45"

“Stanislaw Ulam, namesake ofStan and co-inventor MonteCarlo methods shown hereholding the Fermiac, EnricoFermi’s physical Monte Carlosimulator for neutron diffusion.”(image from the Stan manual)


Bayesian Statistics

By Bayesian data analysis, we mean practical methods formaking inferences from data using probability models forquantities we observe and about which we wish to learn.

The essential characteristic of Bayesian methods is theirexplict use of probability for quantifying uncertainty ininferences based on statistical analysis.

[Gelman et al., Bayesian Data Analysis, 3rd edition, 2013]


Background on Bayesian Statistics

From Bayes’ rule, supposing the data is fixed (observed):

p(θ|y) = p(y, θ)

p(y)=

p(y|θ)p(θ)p(y)

=p(y|θ)p(θ)∫

p(y, θ)dθ

=p(y|θ)p(θ)∫p(y|θ)p(θ)dθ

p(θ|y) ∝ p(y|θ)p(θ) = p(y, θ)

I Everyone: Model data as randomI Bayesians: Model parameters as random


Background on Bayesian StatisticsFrom Bayes’ rule:

p(θ|y, x)︸︷︷︸Posterior

∝ p(y|θ, x)︸︷︷︸Liklihood

p(θ, x)︸︷︷︸Prior

θ Parametersy Dependent data (response)x Independent data (covariates/predictors/constants)

Posterior: The answer, probability distributions of parametersLiklihood: A computable function of the parameters, modelspecificPrior: "Initial guess", incorporates existing knowledge of thesystem

The key to building Bayesian models is specifying the likelihoodfunction, same as frequentist.


Monte Carlo Markov Chain (MCMC) in a nutshell

I We want to generate random draws from a target distribution(the posterior). We then identify a way to construct a ’nice’markov chain such that its equilibrium probability distributionis our target distribution.

I If we can construct such a chain then we arbitrarily start fromsome point and iterate the markov chain many times (like howwe forecasted the weather n times). Eventually, the draws wegenerate would appear as if they are coming from our targetdistribution.

I There are several ways to construct ‘nice‘ markov chains (e.g.,gibbs sampler, Metropolis-Hastings algorithm).

(explanation from Cross Validated)- MCMC is really a way to solve integrals that are impossible tosolve analytically.


What does stan do?

I Samples from the posterior distribution (if your model isspecified correctly)

I "Fits" bayesian modelsI Empowers you to write your own Bayesian models, it’s much

easier than you think!

No U-TurnSamplerAutomatic Step Size andNumber Adaptation


Why Stan?

There are tons of other “black-box” MCMC samplers out there(BUGS, JAGS, Church, PyMC, many many more,http://probabilistic-programming.org/wiki/Home)

I Stan is open sourceI Built to be fast (about 10 times faster then BUGS according

Gelman)I “Stan can handle problems that choke BUGS and JAGS” –

Andrew Gelman

http://probabilistic-programming.org/wiki/Home


Using Stan

Stan is a library with a number of interfaces, we will use the Rinterface called RStan.

Stan Model Code (.stan)

Setup script (.R)

RStan

Stan C++ code Compiler

Model Run Data

Results


Example 1 – Fit Normal Distribution – Model codeDownload from: http://bechtel.colorado.edu/~bracken/stan/example_models.zipExample files in 1-normal: normal.stan/normal.R

data {int<lower=0> N; // error checking for Nvector[N] y;

}parameters {

real<lower=0> sigma;real mu;

}model {

y ~ normal(mu, sigma); //vectorized}

http://bechtel.colorado.edu/~bracken/stan/example_models.zip

http://bechtel.colorado.edu/~bracken/stan/example_models.zip


Example 1 – Fit Normal Distribution – RStan code

library(RStan)model_file = 'normal.stan'iterations = 500N = 1000mu = 100sigma = 10y = rnorm(N, mu,sigma) # simulate data

stan_data = list(N=N, y=y) # data passed to stan# set up the model

stan_model = stan(model_file, data = stan_data, chains = 0)stanfit = stan(fit = stan_model, data = stan_data,

iter=iterations) # run the modelprint(stanfit,digits=2)


Diagnostics - Text output

Inference for Stan model: normal.4 chains, each with iter=500; warmup=250; thin=1;post-warmup draws per chain=250, total post-warmup draws=1000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhatsigma 9.84 0.01 0.19 9.47 9.72 9.84 9.98 10.23 366 1.01mu 99.68 0.01 0.31 99.08 99.48 99.69 99.88 100.30 617 1.00lp__ -2785.83 0.04 0.87 -2788.07 -2786.19 -2785.58 -2785.19 -2784.97 395 1.00

Samples were drawn using NUTS(diag_e) at Thu Feb 19 21:04:56 2015.For each parameter, n_eff is a crude measure of effective sample size,and Rhat is the potential scale reduction factor on split chains (atconvergence, Rhat=1).

I thin, warmupI ne f f

I R̂I lp__


Diagnostics - Traceplots

0 100 200 300 400 500

02000

4000

6000

8000

Trace of sigma

Iterations

0 100 200 300 400 500

050

100

150

Trace of mu

Iterations

0 100 200 300 400 500

-2000000

-1000000

0

Trace of lp__

Iterations


Diagnostics - Posterior plots

Histogram of extract(stanfit)$mu

extract(stanfit)$mu

Frequency

98.5 99.0 99.5 100.0 100.5

050

100

150

200

250

Histogram of extract(stanfit)$sigma

extract(stanfit)$sigma

Frequency

9.4 9.6 9.8 10.0 10.2 10.4

050

100

150

200

Histogram of extract(stanfit)$lp__

extract(stanfit)$lp__

Frequency

-2791 -2789 -2787 -2785

0100

200

300

400


Diagnostics - shinyStan

Old package (don’t use it):https://github.com/jgabry/SHINYstanChanging to: https://github.com/shinyStan/shinyStan

library("shinyStan")launch_shinystan(stanfit)

https://github.com/jgabry/SHINYstan

https://github.com/shinyStan/shinyStan


Convergence - Traceplots

Converged:

0 500 1000 1500 2000

050

100

150

Trace of beta_space_loc[1]

Iterations0 500 1000 1500 2000

1415

1617


Iterations

0 500 1000 1500 2000−0.007

−0.004


Iterations0 500 1000 1500 2000

−0.005

0.005

0.015 Trace of beta_space_loc[4]

Iterations

0 500 1000 1500 2000−0.015

−0.005

0.005 Trace of beta_space_loc[5]

Iterations

Not Converged:

0 200 400 600 800 1000

−200

−100

050

Trace of gp_knot_value[1]

Iterations0 200 400 600 800 1000

−200

−100

050


Iterations

0 200 400 600 800 1000

−250

−150

−50


Iterations0 200 400 600 800 1000

−250

−150

−50


Iterations

0 200 400 600 800 1000

−150

−50

50150 Trace of gp_knot_value[5]

Iterations0 200 400 600 800 1000

−200

−100

050


Iterations

0 200 400 600 800 1000−300

−150

−50


Iterations0 200 400 600 800 1000

−200

−100

0


Iterations


Convergence - Posterior plots

Converged:beta_space_loc_01 beta_space_loc_02 beta_space_loc_03

beta_space_loc_04 beta_space_loc_05

0

200

400

0

200

400

0

100

200

300

400

0

200

400

600

0

200

400

0 50 100 150 14 15 16 17 −0.007−0.006−0.005−0.004−0.003−0.002−0.001

−0.0050.0000.0050.0100.015 −0.015−0.010−0.0050.000value

count

Not Converged:beta_space_loc_01 beta_space_loc_02 beta_space_loc_03

beta_space_loc_04 beta_space_loc_05

0

500

1000

1500

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

0

500

1000

1500

−139.6125−139.6100−139.6075−139.6050−139.6025 9.9800 9.9825 9.9850 9.9875 0.030 0.035 0.040 0.045 0.050 0.055

0.010 0.015 0.020 0.025 0.024 0.027 0.030 0.033value

count


Example 2 – Fit normal distribution (fancier)

yn ∼ Normal(µ, σ)

The likelihood of observing a normally distributed data value is thenormal density of that point given the parameter values.

Example files in 2-normal: normal2.stan/normal2.R

I priorsI initial values


Example 3 – fit GEV distribution

Generalized extreme value distribution

y ∼ GEV(µ, σ, ξ)

µ: location, σ: scale, ξ: shape

Example files in 3-gev: gev.stan/gev.R

I Random seedI constraintsI variable constraintsI Initial values


Example 4 - multiple linear regression

yn = α + βxn + εn

where εn ∼ Normal(0, σ), which can be written:

yn − (α + βxn) ∼ Normal(0, σ)

yn ∼ Normal(α + βxn, σ)

The likelihood of observing a given yn is just the normal densitywith mean α + βxn and standard deviation σ.

Example files in 4-multiple-linear-regression:mlr.stan/mlr.R


Example 5 - Binomial regression (glm)

yn = α + βg−1(xn) + εn

yn ∼ Normal(g−1(xn), 1)

Example files in 5-binomial-logit:binomial-logit.stan/binomial-logit.R

I Hierarchical


Stan tips and tricks#1 tip: Read the Manual! It is excellentOther things we didn’t really talk about:

I Local variables in the model block, can be used to storeintermediate results

I Matrices vs arrays, Column vector vs row vectorI Constrained data typesI Transformed parametersI FunctionsI Logical operations/Other types of loopingI Elementwise operatorsI Built-in functionsI Print statementsI Missing dataI PredictionI Discrete variables


Stan tips and tricksNo need to truncate priors, do that in the parameter bounds

I BAD: setting constraints on parameters but using a prior withother constraintsparameters{

real alpha; //implies no constraints}model{

alpha ~ uniform(0,1);}

I GOOD::parameters{

real <lower=0,upper=1> alpha;}model{

#alpha ~ uniform(0,1); // default uniform priors}


Stan tips and tricks

I No need to use conjugate priorsI Unlike BUGS (or other Gibbs based samplers), avoid super

vauge priors if you can, i.e. inv_gamma(0.1,0.1)I When in doubt, use a normal prior, or google itI The Stan mailing list is very active


Speeding up Stan modelsI Avoid repeated operations

// 1/alpha is repeatedfor(n in 1:N)

y[n] ~ exponential(1/alpha * x[n]);I Vectorization is always faster

// not vectorizedfor(n in 1:N)

y[n] ~ normal(beta0 + beta1 * x[n], sigma);//vectorizedy ~ normal(beta0 + beta1 * x, sigma);

I Priors: More informative the better (think better initialconditions), use MLE to get initial estimates

I Parallization: can run multiple chains if you have multiplecores, but each chain is still serial

I More advanced: Access increment_log_prob directly

Documents

Introduction to Stan - University of Colorado Boulderbechtel.colorado.edu/~bracken/tutorials/stan/stan-tutorial.pdf · Introduction to Stan Cameron Bracken University of Colorado