最尤理論、誕生から百年の統計学と今後の方向への一考察Fisher’s equation Fundamental theorem of natural selection R A Fisher digital archive, Univ of Adelaide

江口真透

統計数理研究所

総研大夏期大学院 2012年9月19日 20日, 統数研

最尤理論、誕生から百年の統計学と

今後の方向への一考察

Shinto Eguchi

Institute of Statistical Mathematics

Summer-semester in Guas, 19-20 September, 2012, ISM

Maximum likelihood theory

from Ronald Fisher in 1912

with discussion on future directions

Outline

MLE 100 years from Fisher 1912

Two cultures in statistics and AdaBoost

Information geometry

Poincaré conjecture and Optimal transport theory

Power entropy and divergence

Likelihood for equation model

R. A. Fisher

Ronald A. Fisher(1890-1962 )

Population geneticsStatistics

analysis of variance Genetical Theory of Natural Selection

maximum likelihood estimation

Wright-Fisher modelDesign of experiment

Randomization test

Linear discriminant analysisFisher’s equation

Fundamental theorem of natural selection

R A Fisher digital archive, Univ of Adelaide

Maximum likelihood

MLE 1912-2012, Aldrich (1997)

Efficiency, sufficiency, Fisher information,…Boltzmann-Shannon entropy, Kullback-Leibler divergenceMax entropy distribution (exponential model)

Fisher (1912)

David Scott Technometrics (2001)

Asymptotic efficiency

Statistical model

Maximum likelihood estimator

Asymptotic normal

Confidence interval 1)},(|)(Pr{ frC

})ˆ)(()ˆ(:{)( 2T rInrC

Estimative distribution

Asymptotic efficient

Most cited statisticians

Fisher (1912)

Fisher (1912)

Fisher (1912)

Fisher (1912)

)

)

)

1

2

3

Fi

sh

er

(1

9

1

2)

Cf. Physics, Medicine, Mathematics

Statistical curvature

Statistical curvature

Statistical model

Information geometry

Information metric

Mixture connection

Exponential connection

Rao (1949) Amari (1982 )

13

m‐geodesic and e‐geodesic

m‐geodesic

e‐geodesic

))1,0(()(),( )e( tMpMqp txxA model M is e‐geodesic if

))1,0(()(),( )m( tMpMqp txx

A model M is m‐geodesic if

})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential model

})}({)}({:)({0

)m( xtxtx pp EEpM Mean match model

p

q

rNagaoka‐Amari (1982)

KL divergence (1951)

Kullback-Leibler divergence

AIC (Akaike, 1973 )

MLE on Exponential family

})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential family

Likelihood equation

)}({),(log)(1

θtθθxθ

Tn

ii npL

}eE{log)( )(T

0xtθθ pCumulant function

)}({E)(: xtθθ

η θ

)}({var)(T

2xtθ

θθ θ

Log likelihood function

tηtθθ

MLor)(

})}({:)({)m( ηxtx pEpMMean match model

)(*)}({max)(max tθtθθθθ

Tn

LMaximum likelihood

Barndorff-Nielsen (1978), Brown (1986),

Pair of (model, estimation)

Model

Estimation leaf

Amari (1982 )

)e(M

)m(M

Non‐exponential family

11

)1(),,(

xxf

t‐distribution

Generalized Pareto distribution

Univariate

p‐variate

Unified view to power exponential family

Gibbs‐Boltzmann‐Shannon

Josiah Willard Gibbs(1839 - 1903)

Claude E. Shannon(1916- 2001)

Ludwig E. Boltzmann(1844 - 1906)

)),((),(log GBS1

1

pHpn

iin

x

GBS-entropy

Log-likelihood

A new estimator if we extend GBS-entropy ?

)(1

11

)(

W

j

qjq p

qkS p

Hill’s diversity index in 1973

aS

j

aja pD

1

1

1)()( p

)(100

log1

1lim)}(log{lim

S

j

aj

aa

ap

aD p

)(log1

pHppS

jjj

Sjjp j ,....,1for speice a offrequency relative a beLet

Projective Entropy

‐cross entropy

‐diagonal entropy

Cf. Good (1972) Fujisawa‐Eguchi (2008) Eguchi‐Kato (2010)

‐divergence

Remark: The diagonal entropy is equivalent to Tsallis entropy

Ferrari‐Yang (2010)

Three properties

Linearity

scale invariance

Lower Bound

These properties lead to uniqueness for C ?

Theorem

Remark:

Proof

Shannon‐Khinchin axiom

Cf. Hanel and Thurner (2011)

Max ‐entropy

-entropy

Equal moment space

Max -entropy

-normal

Eguchi-Komori-Kato (2011)

‐EstimationParametric model

-Loss function

-estimator

Gaussian

-estimator

remark

Escort normal family

Escort distribution

‐estimation on ‐normal model

-loss function

Pythagoras triangle

Loss decomposition

Pythagorian

MLE for normal model

Equal moment space

Max 0-entropy

Gaussian

MLE is the sample mean and variance

Likelihood

Teicher’s theorem (1961)

if and only if g is Gaussian.

Let gbe a location-scale family.

-estimator leads to -normal model

= 0 (Gaussian) Cf.Teicher (1961)

model

loss function

-loss

0-model -model

M-estimator

emergence

(-estimator ’-model)

0-loss( log-likelihood) MLE

-estimator -estimator

U-cross entropy

U is a convex and increasing function

U-entropy

where

U-entropy and U-divergence

Cf. Eguchi-Kano (2001)

U-divergence

Example 1. Boltzmann-Shannon entropy.

Example 2. Power entropy. Tsallis (1988) Basu et al (1998) Cf. Box-Cox (1964)

Example 3. Sigmoid entropy. Takenouchi-Eguchi (2004)

Tsallis entropy (q=+1)

max U‐entropy model

Mean‐equal space})()(:{)(where

00 XEXEpp pp tt _

Let p0(x) be a pdf and let t (x) be a statistic of dimension d.

)(max)( 0

pHUpp _

Consider a problem

xτtθθ dppUpppL })())(()({),,( 0T The Lagrangian

leads to const. )())(( T xtθxp

))()(()( T θxtθx UUp So , we call U‐model

U‐estimation

Note:

U‐estimator

Let x1,…, xn be from p(x) and let p(x) be a model function.

U‐loss function

Consistency

Asymptotic normality

U‐methodU function

U-model

U-divergenceU-entropyU-cross entropy

U-loss

MaxEntEmpirical

U-estimator

(U-estimator, U-model)

(U-estimator, expo-model)

Pythagoras

Statistical analysis

Robustness

Moment estimate

model

loss function

Uexp = likelihood

U1‐loss

Exponential U1‐model

RobustM‐estimate

MLE

RobustU1‐estimotor

Moment estimator

U1‐estimator under exp‐model

42/32

Minimax game

Cf. Grunwald and Dawid (2004)

)(maxarg where)(

*

0

pHp Upp _

Pattern recognition

Feature vector ),...,( 1 pxxxClass label }1,1{ y

Classifier )( xx fy

Training data )},(,),...,{( 11 nn yyD xx

Statistical classifier ))((sgin)( xx Ff

Set of weak learners

Decision stamp

Linear classifiers

ANNs SVMs k-NNs

No stronger, but exhaustive!

0)(),1()(:setting Initial.1 011

xFniiw n 　　　

,)())((I)( iwfyf tiit x

)(

)(121

)(

)(log)b(tt

ttt

f

f

T

tttTT fFF

1)( )()( where,)(sign.3 )( xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tftt F

Adaboost

Freund-Schapire (1997).

Learning process

)},( ),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww

1

2

1

2

T

)()1( xf

T

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final decision by

T

tttT fF

1)( )()( xx

Learning curve

50 100 150 200 250

0.05

0.1

0.15

0.2

Iteration

Training error rate

Stopping rule

Training data

Final decision function

T

tttT fF

1)( )()( xx

How to find T ?

10‐fold CV Error rate

,),( 1niii yD x

1 2 3 4 109

Validation Training

)1(

Validation Training

)2(

Validation Training

)3(

)(101byAveraing

10

1

kk

T

tttT fF

1)( )()( xx

Update of exp‐loss

)}(exp{1)(1

exp ii

n

i

Fyn

FL x

)()()( xxx fFF

}))(1()(){(exp fefeFL

Consider one update as

Exponential loss functional

n

iiiiiii yfeyfeFy

n 1

))(I())(I()}(exp{1 xxx

n

iiiii fyFy

nfFL

1exp )}(exp{)}(exp{1)( xx

)(

)}(exp{))(()(

exp

1

FL

FyyfIf

n

iiiij

xxwhere

Sequential minimization

}))(1()(){()( expexp efefFLfFL

)()(1log

21

opt ff

)}(1){(2 ff

)}(1){(2)()(12

ffefe

f

Equality iff

efef ))(1()(

Adaboost = sequential min expo‐loss

)(minarg)a( )( ff tFf

t

)}(exp{)()((c) 1 itittt xfyiwiw

)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL

R

)()(1

log21

)(

)(opt

t

t

ff

)(minarg)b( )(1exp ttt fFL

R

Bayes risk consistency

1

exp ),()}(exp{)(y

ydPFyFLI xx

Expected loss functional

xxxx xx dfpp FF )()}|1(e)|1(e{ )()(

)|1()|1(

log21

)(opt xx

x

pp

F

Optimal discriminant function

xxxxxx xx dfpppp

FLIFLI

FF

)(})|1()|1(2)|1(e)|1(e{

)()(

)()(

optexpexp

0)()|1(e)|1(e 2)()( }{ xxxx xx dfpp FF

A simulation

-1 -0.5 0.5 1

-1

-0.5

0.5

1

[-1,1]×[-1,1]

Feature space

Decision boundary

-1 -0.5 0.5 1

-1

-0.5

0.5

1

Set of linear classifiers

Linear classifier

Random generation

Learning process

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08

Final decision

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Contour of F(x) Sign(F(x))

U‐boost for classification

U-boost:

Let (x, y) be a variable with feature vector x and label y

Decision rule

Step 1.

Step 2.

Murata et al (2004)

U‐Boost for density estimation

)( argmin *

* fLf Uf UW

pfUf

nfL

n

iiU

RI1d)))((())((1)( xxx U-loss function

)( ))((co 1* WW U

Dictionary of density functions

},1d)(,0)(:)({ xxxx gggW

Learning space = U-model

)}( ))(({ 1

xg

Goal : find

)())0,1,,0(,(Then.))((),(Let )(

1 )( xxxπx

gfgf

WW * U

U‐Boost algorithm

)(minargFind)A( 1 gLf Ug W

)))()()1(((minarg),(

st))()()1((Update)B(

111

1111

1

)1,0(),(gfLg

gfff

kUkk

kkkkkk

g

W

))()()1((ˆand,Select )C( 11

KKKK gffK

Example 2. Power entropy 1

* )( )()( k

kk xgxf

k

kk

kkk ggf )()(logexp)(,0 If )(* xxx

k

kk gf )()(,1 If * xx Klemela (2007)

Friedman et al (1984)

Learning?

Boosting is a forward stagewise algorithm

T r u e d e n s i t y

- 4

- 2

0

2

4

- 4

- 2

0

2

4

0

0 . 0 0 5

0 . 0 1

0 . 0 1 5

0 . 0 2

- 4

- 2

0

2

4

W dictionary

Inner step in the convex hull

)(convex* WW }:),({ xgW

*UW

W

)(xf

),( 1xg

),( 2xg),( 3xg

),( 4xg

),( 5xg

),( 6xg

),( 7xg

)(xf

)( argmin *

* fLf Uf W

Goal : )(ˆˆ)(ˆˆ)( ˆˆ11* xxx kk fff

Loss decomposition

),,()()()1()))()()1((( 1 gfgLfLgfL kUUkUkU

))}()()1(())(())(()1{(),,( gfUgUfUgf kkkU where

Note 1.

)a.e.(0),,()2(

,0),,()1(

gfgf

gf

kkU

kU

)},,()({min gfgL kUUg

W

Min problem：

),,( gfkU)(gLUMake smaller, and larger in g simultaneously.

Theorem. Assume that a data distribution has a density p(x). Then we have

Non‐asymptotic bound

),IE( ),EE(),FA()ˆ,(EI * KppfpD UKUp WW

),(inf),(FA fpDp UgUU=W

=W

|}| ))((EI))((supEI2),(EE1

1{ Xx ggp pi

n

ing

p W

W

)constantsare,()(IE UUU

U cbcK

bK

(Functional approximation)

(Estimation error)

(Iteration effect)

),(EEand),(FAbetween Trade Remark. WW pp *U

Naito and Eguchi (2012)

where

Statistical applications

(kernel) U-PCAU-Boost U-SVM

U-ICA (mixture)

U-AUC

U-Boosting density

Robustness (outliers) (Exp-model, U-loss)Redescendency (hidden structure) (U-model, U-estmate)

U-cluster

Evolve (U-model, U-loss)

Drawback of Boosting

○

because

Early stopping (Zhang-Yu,2005), Clemela (2008)

Breiman (1996)

○

○

Meinshausen and Buhlmann (2011), Stadle‐Buhlmann‐van de Geer (2010)

Poincaré conjecture 1904

Every simply connected, closed 3-manifold is

homeomorphic to the 3-sphere

Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003

Optimal control theory due to Pontryagin and Bellman

Optimal transport

Wasserstein space

Optimal transport

Theorem (Brenier, 1991)

Talagrand inequality

Log Sovolev inequality

Optimal transport theory is extending on a general manifold

Geometers consider not a space, but a distribution family on the space.

C. Villani (2010) Optimal transport theory as Fields medal

Fisher's equation in 1937

Fisher-Kolmogorov equation

Ecology, physiology, combustion, crystallization, plasma physics, phase transition problems cf. Wiki (reaction diffusion process)

Reaction–diffusion systems

Statistical inference for noisy nonlinear ecological dynamic systems

Statistical algorithm

Sufficient statistic

Ricker map

Sampling

Statistical parameter

Simulated curved normal model

Conclusions and Future directions

Two cultures

Optimal transport theory

Simulated likelihood

Simulation modeling

Monte Carlo search

MLE 1912

From log to power function

Dynamics and statistics

MCMC

Big data

Release Fisher’s mind control

Documents

最尤理論、誕生から百年の統計学と 今後の方向への一考察Fisher’s equation Fundamental theorem of natural selection R A Fisher digital archive, Univ of Adelaide

最尤理論、誕生から百年の統計学と今後の方向への一考察Fisher’s equation Fundamental theorem of natural selection R A Fisher digital archive, Univ of Adelaide