Bayesian Methods Lectures I+IIbayes/18/lectures/vToussaint... · 2018. 1. 24. · U. von Toussaint, Bayesian Lecture I+II+III+iV,Summer School SA, Jan 2018 Recommended Literature:

U. von Toussaint, Bayesian Lecture I+II+III+iV,Summer School SA, Jan 2018

Bayesian MethodsLectures I+II

U. von Toussaint

Max-Planck-Institute for Plasmaphysics

Garching


Recommended Literature:

Introduction: Data Analysis, 2nd edition (D.S. Sivia): Bayesian Logical Data Analysis (Phil Gregory)

General: Bayesian Probability Theory (W. von der Linden)Pattern Recognition and Machine Learning (C.M. Bishop)

Information Theory, Inference and Learning Algorithms (D.J.C. MacKay)Bayesian Reasoning (D. Barber)

Review article: Bayesian Inference in Physics in Review of Modern Physics 83 p. 943-999, (2011) :-)


I. Experiment II. Evaluation

IV. Hypothesis III. Hypothesisverification building

Scientific Inference Cycle

Introduction


III. Hypothesis DesignIV. Hypothesis Verification



IV. Hypothesis III. Hypothesisverification building

Scientific Inference Cycle

Introduction

I. Experiment II. EvaluationParameter Estimation

III. Hypothesis DesignModel Selection

IV. Hypothesis VerificationExperimental Design

Prerequisite: Consistent reasoning...


Outline

Inference in Science Processing of Information

Non-Gaussian Likelihoods Change Points Integral Eq. / Deconvolution

I. Scientific Inference

II. Parameter Estimation

III. Model Comparison Basic Concept Mass Spectroscopy Background Subtraction

IV. Numerical Interlude Integration methods

V. Experimental Design Basic Concept Optimized observation times Nuclear Reaction Analysis

VI. 'Non-parametric' models Gaussian Processes


Inference in Science

Science: prior information + new data new knowledge

Prior information: continous learning process- old data- calibration measurements, validation data- theoretical considerations- parameters- model

New data: - measurements- calibration- theories

But: prior information and data are uncertainnew knowledge and derived hypotheses are uncertaininductive (instead of deductive) reasoning requiredCalculus for “uncertainties” needed (quantification)



Science: prior information + new data new knowledge

a) deductive reasoning:

causeEffects

or outcomes

Effectsor

observations

b) inductive reasoning

Example: fair coin: p(7 heads | 10 tosses)=?

Example: 7 heads out of 10 tosses: p(fair coin)=?

possiblecauses



Calculus: Cox, Knuth, Skilling,...: Basic requirements (i.e. symmetry,consistency)

single out usual rules of probability theory to handle uncertainty

Please note: probability here not restricted to frequency interpretation:

Degree of belief about a proposition

Data: - statistical (counting statistics)- measurement uncertainty (ruler)- systematic (e.g. misalignment)- outliers

Notation: p(x|I)p(x|I) describes how probability (plausibility) is distributed among possible choices for x for the case at hand (information I)

Hypotheses:- parameter of interest- nuisance parameters- physical models- future data



Bayesian interpretation:p(x|I) describes how probability (plausibility) is distributed among possible choices for x for the case at hand (information I)

Frequentist interpretation:p(x|I) describes how x is distributedthroughout an infinite (hypothetical)ensemble probability = frequency

p(x|

I)

x

p is distributed

x has a single, uncertain value

x

x is distributed

p(x|

I)

Calculus: Cox, Knuth, Skilling,...: Basic requirements (i.e. symmetry,consistency)

single out usual rules of probability theory to handle uncertainty


Processing of information

Processing of information: Combination of (cond.) probability distributions

Conditional Probabilities:

p(A| B) : probability of proposition A given truth of proposition B:quantification of uncertainty of A

Probability Theory Axioms:

- sum rule (OR):

- product rule (AND):

Bayes Theorem



Bayes' theorem:

Prior: knowledge before experiment (logically)Likelihood: Probability for data if the hypothesis was truePosterior: Probability that the hypothesis is true given the dataEvidence: normalization; important for model comparison

Maximum Likelihood approach (parameters which maximise probability for data) does not yield most likely parameters*!

Examples: p(wet street | rain,I) ≠ p(rain | wet street,I)p(female | pregnant,I)≠ p(pregnant | female,I)

Posterior

Likelihood Prior

Evidence

*) in general, except e.g. flat, unbounded priors



Toy Example: Mass of elementary particle

- Prior knowledge: 0 ≤ m ≤ mupper

=0.2

- measurement uncertainty σ=0.15- measured data:

d1=0.09,

d2=-0.2,

d3=0.05

Maximum likelihood: m=-0.02 (<0!)

Bayes: 1) Assign prior:

2) Assign likelihood:

3) Compute posterior using Bayes' theorem:



posterior :

contains complete information:

summarizing quantities:- mode: m=0- mean: <m

0>=0.06

- 95%-interval: [0;0.145]

Non-gaussian likelihoods:

E.g.

for d3 with β=0.05



Marginalization:

generalized error propagation

( )||

()

)

)

((

p d aa

pp

a

p dd

×=+Bayes theorem

prior distribution

p(d|a) likelihood distribution

posterior distribution

( , | )

)( |

( |

( , )

(

)

)

,

db p a b d

dbp ap d b

p

d

a

a b

d

p

=

=

∫∫

+an additional (nuisance) parameter b:

Reasoning about parameter a:(uncertain) prior information p(a|I)

(uncertain) measured data d=D ± ε physical model D=f(a)

++

Clear recipe how to tackle a problem – possibly demanding mathematics/numerics

1) Quantify information at hand in probability distributions2) Multiply probability distributions 3) Marginalize nuisance parameters4) Analyze posterior distribution


Outline

Non-Gaussian Likelihoods Change Points Deconvolution

II. Parameter Estimation


Beyond Gauss...

Least square approach:

• Computationally very convenient• Supported by law of large numbers• De-facto standard in scientific community

Model equation: μ=di + ξ

i with ξ

i=N(0,σ2)

Likelihood:

which can be rewritten as Gaussian with

Mean-value and variance

p d∣ , , I =∏i

1 i2

exp [−1

2 i2 d i−

2]

= 2∑i

d i

i2

2=

1

∑i

1

i2



• Result should be disturbing: Actual data scatter is not taken into account !

Beyond Gauss...



• Result should be disturbing: Data scatter is not taken into account Transformations of data with d

i'=d

i + γ(d

i – μ) do not change

estimated mean and variance!

Beyond Gauss...




i'=d

i + γ(d



Beyond Gauss...




i'=d

i + γ(d



Both data set yield identical results formean and variance

•Particular property ofleast-square estimation:Uncertainty estimate has tobe correct

Beyond Gauss...


CODATA-Group:

Panel of experts, evaluating every 5-10 years the best estimates (and uncertainties) of physical constants, e.g.

elementary charge ePlanck's constant hgyromagnetic ratio of the proton γ

p

...Non-trivial because of coupling of constants and quality of input data

Beyond Gauss...

Applied Method:

Least-Squares


• Deviations from the probability distribution are to be expected• Outliers -> Mixture models (Sivia 1996, Press 1998,...)• Wrong uncertainty estimate• Small experimental inaccuracies• ... Use generalised distributions: A) generalized Gauss

Includes standard Gaussian for α=2.

p d∣G, ,= 12 211 /

exp[−∣d−G 2 ∣

]

Beyond Gauss...


•

p d∣G, ,= 12 211 /

exp[−∣d−G 2 ∣

]Beyond Gauss...


α needs to be marginalised

Which prior for α ?

• Definitely bounded• Jeffreys prior inapplicable(singular Fisher information matrix)• Flat prior clearly ruled out

p d∣G, =∫d p p d∣G, ,

Beyond Gauss...


α needs to be marginalised

Which prior for α ?

Approach suggested by e.g.Zellner[1]:

Maximize informationdistance betweenprior and posterior

p d∣G, =∫d p p d∣G, ,

p(α)~exp(I(α))

with

resulting in: 0<α<α0

I =∫d d p d∣G, , lnp d∣G, ,

p =1Z 0

exp−1/ 11 /

divergence

Beyond Gauss...


•Generalized Gauss still assumed perfect knowledge of uncertainty σ In reality: only point estimate s of σ is known (see e.g. [3]) :

MaxEnt-Principle specifies p(ω) (expectation value=1):

Then:

resulting in

p i∣s i ,i= i−s i

i

p =exp−∑ii

p d∣G,s =∫d p ∫d ∏i i−

s i

i

p d i∣G , i

p d∣G,s =∏i

1s i2

12xi

2 [1− 2xi

exp 1

4xi2 erfc

12xi

] xi=d i−G

si 2

Beyond Gauss: Marginalized Uncertainty


Properties of

•Heavy tails : ~ 1/x2

•Singularitiescancel:

~ 1-2x2

p d∣G,s =∏i

1s i2

1

2xi2 [1− 2xi

exp 1

4xi2 erfc

12xi

] xi=d i−G

s i 2



• Data for Newtons constant of gravitation (offset from median)

G [1011 G/m3kg-1s-2]

LS: 6.674 274 (+67;-67)GG: 6.674 200 (+71;-103)GM: 6.674 233 (+49; -88)



• Does the data support the use of LS?

- Birge ratio RB=(χ2/(N-1)) = 2.35 (instead of 1)

Approach by CODATA: multiply LS uncertainty by 10.

- Posterior estimate of p(α| d,σ): Strong deviation from α=2

Validation: Generation of mock data with same mean and variance but different R

B:

Large spread (N_data=8) ?



•p(α| d,σ): Should center around α=2 for increasing sample size

Validation: Generation of mock data with same mean and variance and R

B=1

---------------------ND <α> σ8 4.26 2.09 16 3.58 1.4932 2.30 0.4664 1.96 0.25



•Robustness against outliers of the three approaches:

- use of additional fictitious data point

- response of estimate

Conclusion:- LS: unacceptable- GM: very good- GG: very similar



Extensions to multivariate problems: elementary charge and Planck's constant from Josephson constant K

J, Klitzing constant R

K

(KJ,=2e/h; R

K=h/e2 ) [2]



Comparison of LS, GG, GM

• Both GG and GM take into account deviations from (assumed) Gaussian distribution

• GM exhibits outlier robustness with sensible prior assumptions

GM approach recommended whenever doubts about (Gaussian) data distribution do exist

Open issue: - Appropriate error model?

→ Mixture modelling

Beyond Gauss: Summary I


Combining incompatible measurements

• Fraction β of data may suffer from mis-specification: Mixture models

• Assignment is a-priori unknown

Mixture Models: Outlier Detection

• Some data points exhibit wrong scaling of uncertainty, eg.


Combining incompatible measurements

Likelihood :


with

and

Form of likelihood A depends on assumed uncertainty properties



and

Example: N=100, β=0.1, ω=10, σ=0.1



Example: N=100, β=0.1, ω=10, σ=0.1



Posterior distribution for mean μ and outlier probability β:

:

Mean μ and outlier probability β are uncorrelated and independent of s and ω → use of flat priors

Typically: brute-force numerics – here β-marginalization possible:



:

Interlude: The β-distribution and -function:

• Often relevant for probabilities on probabilities (Bernoulli problems)• Definition of beta-function:



:

The β-distribution: defined (domain) for x in [0,1]:



:

The β-distribution:

• Very flexible distribution on [0,1]



Rewrite in terms of β-distribution:

:

Coefficients C can be derived from equality

and the resulting recurrence relation (υ=1,..,n)



Marginal distribution for β:



• Example: N=100, β=0.1, ω=10, σ=0.1, μ=1


- Detection of changes in a data series: - Are there changes?

- How many?- Where?

- Example: Time series of onset of cherry

Blossom*

- Plausible hypotheses: - constant onset- linear trend- change point(s): piecewise linear

Change point analysis



- How many?- Where?

Example: Change in mean and variance at unknown location



Example: Change in mean and variance at unknown location m


Completing the square in the exponent (standard trick for Gaussians):

Mean and variance are required but unknown: marginalization→ need prior distributions


Example: Change in mean and variance at unknown location m


Marginalization over mean μ and variance σ :

Plugging in the prior distributions:


Integration over the mean (limit W →infinity): standard Gaussian integral


Marginalization over variances σ : again of standard type

Which yields

(factor ½ missing? (UvT))


Prior for change point: flat


Now all terms are specified: compute posterior prob. for all values of m



Data have been generated with change point m=80.

Please note: at present no pdf for mean and variance given Exercise: equations for mean and variance distributions in the both parts



- How many?- Where?


Blossom*


- Model: ; with E=xk

- Likelihood: ~

*) V. Dose, A. Menzel, GCB 10, p.259 (2004)




- How many?- Where?


Blossom*


- Model: ; with E=xk

- Likelihood: ~

*) V. Dose, A. Menzel, GCB 10, p.259 (2004)



- Posterior for change point position E:

- How about f,σ in likelihood?

Nuisance parameters

Marginalize parameters

Maximum: mid-eighties (~1985) , same result for several other species

Broad probability distribution of change points: Prediction requires weighted averaging



- Posterior for change point position E:

- How about f,σ in likelihood?

Nuisance parameters

Marginalize parameters

Broad probability distribution of change points: Prediction requires weighted averaging:

rapidly increasing confidence limits



Linear deconvolution of apparatus function: - Ubiquitous problem in physics: Most measurements suffer from limited resolution

Example: Rutherford Backscattering Analysis

- 2.6 MeV 4He on Co/Si Isotopically pure

apparatus function A

- 2.6 MeV 4He on Cu/Si spectra to be deconvolved

Convolution:

Deconvolution:

Deconvolution I

d i=Dii=∑j=1

N

Aij f ji

f uc=A−1d


Linear deconvolution of apparatus function: - Ubiquitous problem in physics: Most measurements suffer from limited resolution

Example: Rutherford Backscattering Analysis

- 2.6 MeV 4He on Co/Si Isotopically pure

apparatus function A

- 2.6 MeV 4He on Cu/Si spectra to be deconvolved

Convolution:

Deconvolution:

Noise and prior knowledge have to be taken into account...

Deconvolution I

d i=∑j=1

N

Aij f ji

f uc=A−1d


Likelihood function:

Poisson: counting experiment

with

Prior knowledge (in this case):

- Neighbored channels f are correlated (unknown correlation length):

Adaptive kernels

- positive (additive) distribution h: Entropic prior (Skilling 1991)

with

Deconvolution I

p d∣D =∏i=1

N Did i

d i !exp−Di

Di=∑ Aij f j

p h∣ , I ~expS h S=−h x ln h x

m x


Combining everything:

Likelihood x prior:

Posterior for f:

Deconvolution I

R. Fischer et al, PRE55,6667 (1997)


Nonlinear deconvolution: RBS-Depth profile estimation

RBS - Rutherford Backscattering Spectrometry

- Widely used for near-surface layer analysis of solids (2-20 µm)

- Elemental composition and depth profiling of individual elements

- Spectra depend nonlinear on elemental composition:

due to stopping power and energy dependent cross sections

Deconvolution II

EEnergysensitivedetector

αβ

θDi=g c x

400 6000

500

1000

1500

2000 Initial 550°C 650°C Simulation

Co

un

ts

Channel

Si substrate

Ag, 1.2x1018 at/cm2

TiN,1.0x1017 at/cm2



Erosion / Deposition in ASDEX Upgrade Divertor: Isotope marker experiment: 13C/12C-sample

Deconvolution II

12C

13 C

12C

12C13 C

before:

after:

Evaluation hampered by overlapping signal contributions: quantitative evaluation challenging:

Nonlinear forward modeling coupled with adaptive kernels


Nonlinear deconvolution: RBS-Depth profile estimation *)

Erosion / Deposition in ASDEX Upgrade Divertor: Isotope marker experiment

Deconvolution II

12C

13 C

12C

12C13 C

before:

after:

Simultaneous erosionand deposition:

Long standing tension with spectroscopic measurements mitigated

U. von Toussaint et al, NJP, 1999



Erosion / Deposition in ASDEX Upgrade Divertor: Isotope marker experiment

Deconvolution II

12C

13 C

12C

12C13 C

before:

after:

Simultaneous erosionand deposition:

Long standing tension with spectroscopy data understandable


Outline

III. Model Comparison Basic Concepts Mass Spectroscopy Background Subtraction


Model Comparison

How many boxes are in the picture*)?

Desired: Occam's Razor (Prefer simpler models (that fit the data))

*)MacKay, Information theory


Model Comparison

Posterior for θ

Bayesian Approach:

•Bayes Model Comparison requires always alternative models


Model Comparison

*

*) T. Loredo, Garching, 2003


Mass Spectrometry

A +e– ⇒ A+ +2e–

Iions

∆U

+ —

e–

Quadrupole mass spectrometry with electron impact ionization:

0 5 10 15 20

103

104

105

106

QM

S s

igna

l (co

unts

/ s

ec)

mass / charge (amu / e)

CH4

Ee- = 70 eV

Electron impact ionization can lead to complex, instrument dependent fragmentation patterns:

Versatile tool for neutral gas analysis: fast, high dynamic range, flexible,...

CH4 +e– ⇒ CH3+ + H +2e–, CH2

+ + 2H +2e–

CH+ + 3H +2e–, C+ + 4H +2e–

H+ + CH3 +2e–, ... (+13C, D) …


Mass Spectrometry

"a mass spectrometrist is someone,who figures out what something is,

by smashing it with a hammerand looking at the pieces"Quote from Th. Schwarz-Selinger


Mass Spectrometry

Model: with cracking matrix C (C, x are uncertain/unknown)

Likelihood:

S : measurement uncertaintiesE : number of species

Prior terms:

Concentrations x : depending on experiment (e.g. gas mixture)

Cracking matrix C : exponential prior based on point estimates of Cornu&Massot (1979)

Probability for a set of species E:

with

High-dimensional integrals!MCMC-Integration


Mass Spectrometry

Measured data:

4 o.m.dynamicrange


0 10 20 30 40 50 60100

101

102

103

104

105

106

CH4 plasma

data model 6 model 1

rela

tive

in

ten

sity (

a.u

.)

mass / charge (amu/q)

Mass Spectrometry

input:

data:signal + error of 34 mass channels for 27 different plasmas conditions

calibration measurements + errorfor 11 species

prior: cracking estimates for 14 species

ouput:cracking pattern and concentrations (+ errors!)for 14 species(e.g. C4H2 , C3H6)

'shifted' cracking pattern for radical CH3 very bad

*) Th. Schwarz-Selinger, H. Kang, U. von Toussaint et al, various publications (2001-2007)


Mass Spectrometry

model comparison with „Occams Razor“CH3, C2H5, H

fitting noise!0 1 2 3 4 5 6

150

1000

0.0

0.5

1.0

1.5

mis

fit:

1/S

2 (

d -

C ⋅

x)2

975

Ba

yes:

ln

P(E

|D,S

,I)

number of radicals

125

How many radicals are in the plasma?

(<χ2 >

/N)-

1


Neural Networks

• Neural Networks: non-linear parameterized mapping y=y(x,w)

• Workhorse: feedforward network

Input Layer

Hidden Layer

Output Layer

hi= f 1(∑k

K

w ik(1 ) xk+bi

(1 ))z=f 2 (∑

i

I

wi(2 ) hi+b (2 ))

• Neural Networks are trained by minimizing an error function ED

• Large non-linear optimization problem: Backpropagation algorithm for weight updates

•Sigmoidal activation functions:


Soft-X-Ray Diagnostic at W7-X

•Currently favoured setup

(A. Weller):

• Present algorithms for

the tomography problem:

- matrix inversion

- regularized inversion (curvature)

- Maximum Entropy approaches

- Bayesian approaches

•Problem: Algorithms either unreliable or too slow for monitoring (~10ms)

• Solution: Use of fast Neural Networks


Neural Networks

•Problem:

How many cells are necessary?

Trade-off between bias and variance

Approaches to control the complexity:

•Early stopping (requires validation data set)

•Regularization

( )wΩ+= αDEE


Neural Networks

Regularizers: A few examples

•Simple weight decay

•Modified weight decay

•Curvature based regularizers

( ) ∑=Ωi

iww 2

2

1

( ) ∑ +=Ω

i i

i

w

ww

22

2

γ

Ω (w )=12∑n , k (

∂2 yn

∂ x k2 )

•Regularizers can be interpreted as prior probability distributions over parameters (MacKay):

Bayesian Probability Theory can be applied

•But: „Arbitrary“ choice of prior...


Hyperplane Priors

Interpretation of hidden neuron output:

gi=∑k

wik(1 ) x k+b i

(1 )=bi (∑k a ik xk+1)

0g =idefines (K-1) dimensional hyperplane

in K-dimensional x-space

•Orientation of hyperplane depends on network weights

•If there is a prior for hyperplanes Prior for first layer of hidden neurons


Hyperplane Priors

Interpretation of hidden neuron output:

gi=∑k

wik(1 ) x k+b i

(1 )=bi (∑k a ik xk+1)

0g =idefines (K-1) dimensional hyperplane

in K-dimensional x-space

•Orientation of hyperplane depends on network weights

•If there is a prior for hyperplanes Prior for hidden neurons


Hyperplane Priors

•Prior distribution for the weights a

01...2211 =++++ KiKii xaxaxa

• Prior has to be independent of coordinate transforms (translation, rotation)

( ) ( ) ´´11 ...´... iKiiKi dadaapdadaap

=

•Invariances determine prior[1]:

( )( ) 2

122

1

1

...

11... +

++= K

iKi

iKi

aaZaap

•Well known case for straight-line prior: y=mx+t

( )( ) 2

321

11,

mZtmp

+=

[1] V. Dose, Bayesian Inference and ME Methods, 22nd International Workshop, AIP, 2003


Hyperplane Priors

• b determines the slope of the decision boundary: testable information

• slope ∑=k

ikii abs 222

•Prior according to Maximum Entropy Principle:

p ( b∣a , λ , I )= 1Z

exp (− λ2∑

i

bi2∑

k

a ik2 )

•Prior is marginalized over λ employing Jeffrey‘s prior

•The prior is finally given by

p ( a , b∣I )= 1Z

1

(∑i

bi2

(∑k a ik2

))N2

∏i=1

N1

∑k

aik2

Uff……………..


Moving Plasma

•Simulated data set of a

horizontal shift of the plasma

• same noise level as before

• note the absence of vertical

movements


Reliability of reconstruction

•How trustworthy is the reconstruction?

a) Use NN result to compare measured and reconstructed sensor data (necessary but

not sufficient for inverse problems)

b) Additional check:

Different networks differ outside the „known“ regime -> Indication of unexpected events

Complementary verification methods


An extreme example of model comparison: Neural Networks without training data

U. von Toussaint et al, JAO, 2006

Model Comparison


Outlier detection – Background estimation

Form free background estimation:- analytical form unknown

Signature of background: smooth variationFind regions without signal contribution and fit smooth function to this background data

Suited model: e.g. cubic splines, thin plate splines, …

with Φ...Spline basis; E...Number of spline knots (support points) ξ ...Spline knot positions; c

v...Spline coefficients

Signal from background separation by minimal knot spacing Δx

Quantification: Mixture Model

b x =∑v=1

E

x ,v cv

Bi : datum d

i is purely background :

Bi : d

i contains signal contribution :

d i=b i c , ,E i

d i=bi c , ,E isi

PIXE



Assigning the respective likelihoods:

However: signal contribution si is unknown...marginalize with average signal strength s

0 :

yielding the likelihood for the mixture model

W. von der Linden, R. Fischer, V. Dose, various publications (2000)

∫ds ...



Result:

( | ) 0.5

( | ) 0.5o

o i

i

p B d

p B d

<>


Outline

IV. Numerical Interlude Evidence Integrals


Evidence Integrals

Bayesian inference depends on (high-dimensional) integration...

*J. Skilling 2004; MacKay 2006


Relation to statistical physics: partition function

Why is this integration difficult?

Evidence Integrals


Typical situation for expectation values <f>:

Typical situation for evidence calculation:

Evidence Integrals


( ) ( ) ( )∫ Γ= xxxdZ πλ λ

( ) ( ) ( ) ( )∫ ∫∫=

=

=

=

Γ=∂

∂=1

0

1

0

lnln

lnλ

λλ

λ

λ

ρλλ

λλ xxxddZ

dI

Thermodynamic Integration: Slowly introduce likelihood structure into integrand (similar: parallel tempering, perfect tempering):

Z(0)=1 and Z(1)=Evidence

Evidence Integrals


Evidence Integrals

Different approach: Nested sampling...see John's talks :-)


Outline

V. Experimental Design Basic Concept Nuclear Reaction Analysis



Experimental Design

Design of experiments:

Maximize information gain of planned experiments Best performance for various physical scenarios,

hardware constraints, parameter constraints, financial budget

How to quantify the strength (weakness) of an experiment?How to quantitatively chose experimental design parameter(s) ?

(spectral bands, number and position of line-of-sights, measurement times)


Experimental Design

Bayesian Decision Theory

Decisions depend on consequencesMight bet on improbable outcome if payoff is large

Utility functionsCompare consequences via utility quantifying the benefits of a decision

Choice of action: a (eg next accelerator energy E0)

Utility=U(a,o)Outcome: o (eg next yield d(E

0))

Deciding amidst uncertaintyWe are uncertain of what the outcome will be: average possible results

Expected Utility:

Best choice maximizes EU: a*=arg max EU(a) a

EU (a)= ∑outcomes

P (o∣a)U (a ,o )


Information as Utility: Goal:minimize estimation uncertainty for parameter(s) θ or maximize information gain

Utility function: Kullback-Leibler divergence K (generalizes covariance-matrix)

(a=action, d=expected data, θ=parameter)

Putting it all together:

Best choice maximizes EU: a*=arg max EU(a)

High- dimensional integration necessary: Markov Chain Monte Carlo-Methods

Experimental Design

EU (a)=∫dd p (d∣a)K (d , a)=∫d θ p(θ)∫dd p(d∣θ , a)K (d , a)

Utility (d , a)=K (d ,a)=∫d θ p (θ∣d , a)lnp (θ∣d ,a)

p(θ∣I)


Putting it all together: Without initial data:

Experimental Design


Putting it all together: With previous data d (which has been measured before action a) :

Experimental Design

Keep in mind (marginalisation): d


Putting it all together: D=D(a,d) With previous data d (which has been measured before action a ) :

Experimental Design

Keep in mind (marginalisation):

And now ? Have a closer look...

d


Putting it all together: D=D(a) Only two terms are relevant: old posterior and new likelihood

Approximate by posterior samples (eg. via nested sampling or rejection sampling)

Experimental Design

1


Approximate by posterior samples

Experimental Design

1

Just 1-dim integrals remain (no high-dim parameter integration!)


Approximate by posterior samples

Experimental Design

1


Tutorial example:

Model: di = a*cos(b*t

i) + ε

i with ε

i~N(0,0.5)

Prior: p(a|I) flat in [0,10], p(b|I)= c/b for b in [0.04,25]

Measurement at d1(t=1.5)=1.6

Compute best time t2 for next measurement...

Experimental Design

1


Experimental Design

- Hydrogen isotope depth profiling using NRA (not too many other ways)- nuclear reaction: d(3He,p)4He : background free, range

V. Alimov, J. Roth, NIMB, 2006


Experimental Design

c x =a 0 e x p−xa 1

a 2

Sequential Design: Which energy E

0 next?

- depth profile has to be parametrized: piecewise constant or analytically:

c(x)=a0,a

1,a

2,a

3,... or


Experimental Design

Compute EU(E),and select best energy

Integration by posteriorsamplingCPU:1-4 min


Experimental Design

Optimize EU,using posteriorsamplingCPU:1-4 min

Cycle:Prediction -Verification


Experimental Design




Experimental Design




Experimental Design

Quantification of information

Value of measurements /diagnostics / experimental setups accessibleOptimized measurements protocols (on-line sequential design)

• Linear and nonlinear problems can be tackled

• Necessary integrations computationally demanding:

High-dimensional integrals in data- and parameter space Anything beyond greedy algorithms (almost) unexplored (n-step ahead designs)

Considerable gain in efficiency and/or accuracy has been demonstrated

• Result-driven automated measurement strategies ( robotics) conceivable


Outline



Large Scale Models: Methods in Uncertainty Quantification

Methods applied in Uncertainty Quantification

Sampling

Spectral Expansion for Simulators Galerkin Approach Stochastic Collocation Discrete Projection

Surrogate models (Emulators)


Methods applied in Uncertainty Quantification

Sampling

Spectral Expansion for Simulators Galerkin Approach Stochastic Collocation Discrete Projection

Surrogate models (Emulators)

Large Scale Models: Methods in Uncertainty Quantification


Emulators

x Code y( x)

A simulator is a model of a real process Typically implemented as a computer code Think of it as a function taking inputs x and giving

outputs y:

An emulator is a statistical representation of this function Expressing knowledge/beliefs about what the output

will be at any given input(s) Built using prior information and a training set of

model runs

Use of emulators as surrogate for simulators:

Focus on Gaussian Processes


Literature: Gaussian Processes, Rasmussen (available online)Slide adapted from [4]

U. von Toussaint, Bayesian Lecture I+II+III+iV,Summer School SA, Jan 2018Slide adapted from [4]





SOLPS-Application

SOLPS-Data base: - 1500 parameters- collected by D. Coster

Idea: exploit for initialization, scans


SOLPS-Data

• Result of GP-interpolation• Color code: normalized variance


SOLPS-Application

• Present data base “insufficient” …• Physic based line-scans (power, density, temperature...)• Does not cover space (eg. approx. Latin hypercube)


Closed-loop design

• Present data base insufficient + limited human resources

Closed-loop data base generation (eg. based on Bayesian Experimental Design):

• Design Cycle:

- Determine best location(s) for next simulation(s): Utility

- Recompute uncertainty estimates

- Check for design criteria: exit?


SOLPS-Application

Idea: Select input-space locations with largest information gain


SOLPS-Application

Input-space locations with largest information gain: at infinity :-(

Need to define region of interest and 'integral' measures


Closed-loop design

Utility function: Maximize KL-divergence:

or (easier):

minimize variance for ROI

KL(N0 ; N1)=12

log (det (Σ1Σ0−1))+

12

tr Σ1−1 ((μ0−μ1)(μ0−μ1)

T+Σ0−Σ1)


Closed-loop design

Iterated closed-loop design:

Function + est. variance deviation GP-model


Closed-loop design

Reduction of integrated deviation:

noise

shape


Closed-loop design

KL-ROI-measure → Variance measure


Gaussian Processes: Challenges

• Scaling: N3 : not yet prohibitive• Limitation: 'global' scale of covariance matrix

→ work in progress (hyperparameter optimization, adaptive segmentation)

• Correlated output

- Standard approach: independent scalar response variablesDrawback: Prediction not satisfactory: co-variance

- Difficulty: Cokriging extremely expensive


Conclusion

• Bayesian probability theory provides consistent and transparent approach to the cycle of scientific inference

• Incorporation of available prior information is straightforward

• Drawback of numerical complexity mitigated by • New algorithms• Increasing computing power

• State of the Art: parameter estimation, model comparison

• Still (largely) unexplored:- potential of Experimental Design, e.g. robotics, self-adapting

'measurement'-strategies on computer simulations (e.g. automated MD potential generation)

- novelty detection in large scale simulations / experiments


Thank youfor

your attention


⊗ =

Soft X-ray Thomson scattering Integrated result

Integrated data analysis (IDA)

Exploitation of nonlinear correlations


Outlier detection – without censoringMixture approach

AUG Thomson profile with short term fluctuations

Outlier detection

Documents

Bayesian Methods Lectures I+IIbayes/18/lectures/vToussaint... · 2018. 1. 24. · U. von Toussaint, Bayesian Lecture I+II+III+iV,Summer School SA, Jan 2018 Recommended Literature: