Stochastic modelling and quasi-random numbers

Embed Size (px)

Citation preview

Stochastic models + quasi-random
(Teytaud, Tao (Inria), Lri (Paris-Sud), UMR-Cnrs 8623, France;
OASE Lab, NUTN, Taiwan

First part: randomness.

What is a stochastic / randomized model

Terminology, tools

Second part: quasi-random points

Random points can be very disappointing

Sometimes quasi-random points are better

Useful maths
we will need these tools...

Prime number: 2,3,5,7,11,13,17,...

P(A|B): conditionning in probability.

P(dice=1 | dice in {1,2,3} ) ?

P(dice=3 | dice in {1,2} ) ?

Frequency in datas x(1),x(2),...,x(n):

1,2,6,3,7: frequency(odd) ?

frequency ( x(i+1) > x(i) ) ?

frequency ( x(i+1) > 3 | x(i) < 4 ) ?

Modulo: 4%3, 5%3, 7%10, (-1)%4 ? 0%1 ?

Let's take time for understanding random simulations

I guess you all know how to simulate a random variable uniform in [0,1]

e.g. double u=drand48();

But do you know how to simulate one year of weather in Tainan ?

Not so simple.

Let's see this in more details.

Random sequence
in dimension 1

What is a climate model ?

Define:

w1 = weather at time step 1w2 = weather at time step 2w3 = weather at time step 3w4 = weather at time step 4==> let's keep it simple, let's define the weather by one single number in [0,1].(think of temperature, or anything you want...)

I want a generative model

As well as I can repeat u=drand48(), and generate a sample u1, u2, u3, I want to be able to generate

W1=(w11,w12,w13,...,w1T)

W2=(w21,w22,w23,...,w2T)

W3=...

==> think of a generator ofcurves

Random sequence
in dimension 1

What is a climate model ?

Define:

w1 = weather at time step 1The models tells you how can be w1. For example, it gives the density function g: P(w1 in I) = integral of g on I

0

1

Take-home message number 1:

a random variable w on R is entirely defined by P(w in I)for each interval I

0

1

Random sequence
in dimension 1

P(w1 in I) = integral of g on IP(w1 w4 very high and w5 very low is unrealistic; but in this model it happens very often!

Generating wi: also easy with
inv. cumulative distribution ?

Realistic: large-scale variations!

Unrealistic;and average value

almost constant

So how can we do ?

A good model should not give the (independent) distribution of w2, but the distribution of w2 conditionnally to w1 !

w1=invG1( drand48());w2=invG2(w1, drand48());w3=invG3(w2, drand48());

==> does it make sense ? This is a Markov Chain.W1, w2, w3, should NOT be generated independently!

Variant

A good model should not give the (independent) distribution of w2, but the distribution of w2 conditionnally to w1 !

w1=invG1( drand48());w2=invG2( w1, drand48());w3=invG3(w2, w1, drand48());w4=invG4(w3, w2, drand48());w5=invG5(w4, w3, drand48());

==> order 2 Markov chain==> let's stay at order 1 for today

Let's see an example

Assume that we have a plant.

This plant is a function:

(Production,State,Benefit) = f( Demand , State , Weather )

Demand = g(weather,economy,noise)

(where Economy is the part of Economy which is not too dependent on weather)

Benefit per year

= expectation of sum of f3 (= benefit) on one year

Graphically

Weather:

w1, w2, w3, w4, w5, ==> random sequence==> we assume a distribution of w(i) | w(i-1)==> this is a Markov Model ( forget w(i-2) )

Economy

e1, e2, e3, e4, e5, ==> random sequence==> we assume a distribution of e(i) | e(i-1)

Noise = given distribution

==> n1, n2, n3, ....

Graphically

w1

w2

w3

w4

w5

means: dependency

e1

e2

e3

e4

e5

d1

d2

d3

d4

d5

The model should tell you how to generate d2, given d1, e2,w2.(ei,di,wi) is a Markov chain. (di) is a hidden Markov chain: a part is hidden.

How to build a
stochastic model ?

It's about uncertainties

Even without hidden models, it's complicated

We have not discussed how to design a stochastic model (typically from historical archive):

Typically, discretization: w(k) in I1 or I2 or I3 with I1=[-,a], I2=]a,b], I3=]b,]

G(w,w')= frequency of w(k+1) typically, extreme values are more rare in models than in reality

Check the extreme events

Usually, it's good to have more extreme values than datas (because all models tend to make them too rare...).

Example: French climate

France has a quite climate

No big wind

No heavy rains

No heat wave

But:

2003: huge heat wave. 15 000 died in France.

1999: hurricane-like winds (96 died in Europe; gusts at 169 km/h in Paris)

1987: huge rain falls (96 mm in 24 hours)

6.2 times morethan 921 earthquake!

Example: 2003 heat wave

Paris:

9 days with max temp. > 35C

1 night with no less than 25.5C European countries were not ready for this

Example: 2003 heat wave
==> plenty of take-home messages

Bad model: air conditionning sometimes automatically stopped because such high temperatures = considered as measurement bugs ==> extreme values neglected

Heat wave + no wind ==> increased pollution

==> old people die (babies carefully protected...)==> pollution and temperature are not independent

Risk was underestimated:

Maybe (probably ?), climate change had an impact: archive not trustable might stop electricity production

- close to electricity breakdown, due to correlations demand/production- how many people would die in such a case ?

Example: 2003 heat wave
==> plenty of take-home messages

Be careful with extreme values neglected

==> extreme values are not always measurement bugs==> removing air conditionning because it's too hot... (some systems were not ready for such extremal temperatures)

Example: 2003 heat wave
==> plenty of take-home messages

Be careful with extreme values neglected

==> extreme values are not always measurement bugs

Independence is a very strong assumption

P(A) = 0.01 and P(B) = 0.01; what is P(A and B) ?

Example: 2003 heat wave
==> plenty of take-home messages

Be careful with extreme values neglected

==> extreme values are not always measurement bugs

Independence is a very strong assumption

P(A) = 0.01 and P(B) = 0.01; what is P(A and B) ?

do **not** answer 0.0001 ! ! !

Archive = not always trustable (in particular, weather)

Quasi-random points
(Teytaud, Tao (Inria), Lri (Paris-Sud), UMR-Cnrs 8623;
collabs with S. Gelly, J. Mary, S. Lallich, E. Prudhomme,...)

Quasi-random points ?

Dimension 1

Dimension n

Better in dimension n

Strange spaces

Quasi-random points ?

Why do we need random / quasi-random points ?

Numerical integration [thousands of papers; Niederreiter 92]

integral(f) nearly equal to

sum f(xi)

Learning [Cervellera et al, IEEETNN 2004, Mary phD 2005]

Optimization [Teytaud et al, EA'2005]

Modelizat of random-process [Growe-Kruska et al, IEEEBPTP'03]

Path planning [Tuffin]

Where do we need numerical integration ?

Just everywhere.Expected pollution (=average pollution...)= integral of possible

pollutions as a function of many randomvariables(weather, defaults on pieces, gasoline, useof the car...)

Take-home message

When optimizingthe design of somethingwhich is built in a factory,take into account the variance in the productionsystem ==> all cars are different.

==> very important effect

==> real piece != specifications

Why do we need numerical integration ?

Expected benefit (=average benefit...)= integral of possible

benefit as a function of many randomvariables(weather, prices of raw materials...)

==> economical benefit (company)

==> overall welfare (state)

Why do we need numerical integration ?

Risk (=probability of failure...)= integral of possible

failures as a function of many randomvariables(quakes, flood, heat waves, electricity breakdowns, human error...)

Take-home message

Human error must be takeninto account:

- difficult to modelize- e.g. a minimum probability that action X

is not performed (for all actions) (or that unexpected action Y is performed) (what about an adversarial human ?)==> protection by independent validations

Why do we need numerical integration ?

Expected benefit as a function

of many prices/random variables,

Expected efficiency depending on machining

vibrations

Evaluating schedulings in industry (with

random events like faults, delay...) (e.g. processors)

How to know if some points
are well distributed ?

I propose N points x=(x1,...,xN)

How to know if these points are well distributed ?

A naive solution:

f(x)=max min ||y-xi|| (maximized) y i

(naive, but not always so bad)

How to know if some points
are well distributed ?

I propose N points x=(x1,...,xN)

How to know if these points are well distributed ?

A naive solution:

g(x)=min min ||xj-xi||2 (maximized)

i j!=i

= dispersion (naive, but not always so bad)

Low Discrepancy ?

Discrepancy = Sup |Area Frequency | Rectangle

Low Discrepancy ?

Discrepancy2 = mean ( |Area Frequency |2 )

Rectangle

Is there better than random points for low discrepancy ?

Random --> Discrepancy ~ sqrt ( 1/n ) Quasi-random --> Discrepancy ~ log(n)^d/nQuasi-random with N known --> Discrepancy ~ log(n)^(d-1)/n

Koksma & Hlawka :error in Monte-Carlo integration < Discrepancy x V

V= total variation (Hardy & Krause) ( many generalizations in Hickernel, A GeneralizedDiscrepancy and Quadrature Error Bound, 1997 )

==> sometimes V or log(n)^d huge==> don't always trust QR

Dimension 1

What would you do ?

Dimension 1

What would you do ?

Dimension 1

What would you do ?

Dimension 1

What would you do ?

Dimension 1

What would you do ?

Dimension 1

What would you do ?

Dimension 1

What would you do ?

--> Van Der Corput

n=1, n=2, n=3...

n=1, n=10, n=11, n=100, n=101, n=110... (p=2)

x=.1, x=.01, x=.11, x=.001, x=.101, (binary!)

Dimension 1

What would you do ?

--> Van Der Corput

n=1, n=2, n=3...

n=1, n=2, n=10, n=11, n=12, n=20... (p=3)

x=.1, x=.2, x=.01, x=.11, x=.21, x=.02... (ternary!)

Dimension 1 more general

p=2, but also p=3, 4, ...

but p=13 is not very nice :

Dimension 2: maybe just
use two Van Der Corput sequences with same p ?

x --> (x,x) ?

Dimension 2

x --> (x,x') ? with two different basis.

Dimension 2 or n : Halton

x --> (x,x') with diff. prime numbers is ok

(needsmaths...)(as smallnumbersare better,use the nsmallest...)

Dimension n+1 : Hammersley

(n/(N+1),xn,x'n) --> closed sequence

(i.e.,number Nknowninadvance)

Dimension n : the trouble

There are not so many small prime numbers

Dimension n : scrambling
(here, random comes back)

Pi(p) : [1,p-1] --> [1,p-1]

Pi(p) applied to coordinate with prime number p

Dimension n : scrambling

Pi(p) : [1,p-1] --> [1,p-1] (randomly chosen)

Pi(p) applied to coordinate with prime p (there is much more complicated)

Beyond low discrepancy ?

Other discrepancies : why rectangles ?

Other solutions : lattices {x0+nx} modulo 1

(very fast and simple)

Let's see very different approachesLow discrepancy for other spaces than [0,1]^n

Stratification

Symmetries

Why in the square ?

Other spaces/distributions:gaussians,sphere

Some animalsare quite goodfor low-discrepancy

Why in the square ?

Uniformity in the square is ok

But what about Gaussians distributions ?

x in ]0,1[^d

y(i) such that P( N > y(i) ) = x(i)with N standard gaussian

then y is quasi-random and gaussian

==> so you can have

quasi-random Gaussian numbers

Why in the square ?

Other n-dimensionnal random variables by the conditionning trickConsider a QR point: (x1,....xn) in [0,1]^nYou want to simulate z with distribution Zz1=inf { z; P(Z1x1 } = invG1(x1)z2=inf { z; P(Z2 x2 } = invG2(z1,x2)z3=inf { z; P(Z3 x3 } = invG2(z1,z2,x3)...==> ok for strange spaces or variables! ==> QR: choose the best ordering of variables (most important variables first)

Why in the square ?

Theorem: If x is random([0,1]n),

then z is distributed as Z !

==> convert the uniform square into strange spaces or variables

Why not for random walks ?

500 steps of random walks ==> huge dimension

Quasi-random basically does not work in huge dimension

But first coordinates of QR are ok; just use them for most

important coordinates! ==> change the order of variables and use conditionning !

coord1 for 250th point, coord2 for first point,

coord3 for 500th point.

Why not for random walks ?

Quasi-random number x in R^500

(e.g. Gaussian)

Change order: y(250) first (y(250) ---> x(1) )

y(1 | y(250) ) x(2)

y(500 | y(1) and y(250)) x(3)

Why not for random walks ?

500 steps of random walks ==> huge dimension

But strong derandomization possible :start by y(250), then y(1), then y(500), then y(125), then y(375)...

Why not for random walks ?

500 steps of random walks ==> huge dimension

But strong derandomization possible :

see e.g. Hickernel 1998 for nice generalizations drawback = rewriting the model

Very different approaches for derandomization ?

Symetries : instead of x1 and x2 in [0,1],

try x and 1-x

Or more generally, just draw n/2 points, and

use their symetries==> in dimension d, n/2d points and their 2d symetries

2d symetries of (n/2d) points better than n points!

==> antithetic variables==> roughly, it is almost always better, whereas quasi-random might be disappointing

Symmetries in Octave/Matlab

x=rand(800,2);subplot(2,2,1);plot(x(:,1),x(:,2),'+');

x=rand(400,2);x=[x;1-x];subplot(2,2,2);plot(x(:,1),x(:,2),'+');

x=rand(200,2);x=[x;1-x;x(:,1),1-x(:,2);1-x(:,1),x(:,2)];subplot(2,2,3);plot(x(:,1),x(:,2),'+');

x=rand(100,2);x=[x;1-x;x(:,1),1-x(:,2);1-x(:,1),x(:,2)];x=[x;x(:,2),x(:,1)];subplot(2,2,4);plot(x(:,1),x(:,2),'+');

Free !

Antithetic variables in Octave/Matlab

x=rand(800,2);subplot(2,2,1);plot(x(:,1),x(:,2),'+');

x=rand(400,2);x=[x;1-x];subplot(2,2,2);plot(x(:,1),x(:,2),'+');

x=rand(200,2);x=[x;1-x;x(:,1),1-x(:,2);1-x(:,1),x(:,2)];subplot(2,2,3);plot(x(:,1),x(:,2),'+');

x=rand(100,2);x=[x;1-x;x(:,1),1-x(:,2);1-x(:,1),x(:,2)];x=[x;x(:,2),x(:,1)];subplot(2,2,4);plot(x(:,1),x(:,2),'+');

Very different approaches for derandomization ?

Control : instead of estimating E f(x)

Choose g looking like f and estimateE (g-f)(x)

Then E f = E g +E(g-f) is much better

Troubles:

You need a good g

You must be able of evaluating Eg

Very different approaches for derandomization ?

Pi-estimation : instead of estimating E f(x)

Look for y with density (f)d(x)

Then E f(x) = E f(y) d(x)/d(y) ==> Variance is much better

Troubles:

You have to generate y

You have to know (f)

Very different approaches for derandomization ?

Stratification (jittering) :

Instead of generating n points i.i.d

Generate

k points in stratum 1

k points in stratum 2

...

k points in stratum mwith m.k=n ==> more stable==> depends on the choice of strata

Simple stratification

x=rand(40,2);subplot(1,2,1);plot(x(:,1),x(:,2),'+');

x=[ 0.5*rand(10,2); 0.5+0.5*rand(10,2); 0.5*rand(10,1),0.5+0.5*rand(10,1); 0.5+0.5*rand(10,1),0.5*rand(10,1)]; subplot(1,2,2);plot(x(:,1),x(:,2),'+');

Simple stratification

Summary on MC improvements ?

In many books you will read that quasi-random points are great.

Remember that people who spend their life studying quasi-random numbers will rarely conclude that all this was a bit useless.

Sometimes it's really good.

Sometimes it's similar to random.

Modern Quasi-Monte-Carlo methods (randomized) are usually at least as good as random methods ==> no risk.

Jittering / strata / symmetry usually very good.

Summary on MC improvements ?

Carefully designing the model (from data) is often more important than the randomization.

Typically, neglecting dependencies is often a disaster.

Yet, there are cases in which improved MC are the key.

Remarks on random search: dispersion much better than discrepancy...

Biblio (almost all on google)

Pi-estimation books for stratification, symmetries, ...

Owen, A.B. "Quasi-Monte Carlo Sampling", A Chapter on QMC for a SIGGRAPH 2003 course.

Fred J. Hickernell, A generalized discrepancy and quadrature error bound, 1998

B. Tuffin, On the Use of low-Discrepancy sequences in Monte-Carlo methods, 1996

Matousek, Geometric Discrepancy (book 99)

these slides : http://www.lri.fr/~teytaud/btr2.pdfor http://www.lri.fr/~teytaud/btr2.ppt