最尤理論、誕生から百年の統計学と今後の方向への一考察Fisher’s equation...

江口真透

統計数理研究所

総研大夏期大学院 2012年9月19日 20日, 統数研

最尤理論、誕生から百年の統計学と

今後の方向への一考察

Shinto Eguchi

Institute of Statistical Mathematics

Summer-semester in Guas, 19-20 September, 2012, ISM

Maximum likelihood theory

from Ronald Fisher in 1912

with discussion on future directions

Outline

MLE 100 years from Fisher 1912

Two cultures in statistics and AdaBoost

Information geometry

Poincaré conjecture and Optimal transport theory

Power entropy and divergence

Likelihood for equation model

R. A. Fisher

Ronald A. Fisher(1890-1962 )

Population geneticsStatistics

analysis of variance Genetical Theory of Natural Selection

maximum likelihood estimation

Wright-Fisher modelDesign of experiment

Randomization test

Linear discriminant analysisFisher’s equation

Fundamental theorem of natural selection

R A Fisher digital archive, Univ of Adelaide

Maximum likelihood

MLE 1912-2012, Aldrich (1997)

Efficiency, sufficiency, Fisher information,…Boltzmann-Shannon entropy, Kullback-Leibler divergenceMax entropy distribution (exponential model)

Fisher (1912)

David Scott Technometrics (2001)

Asymptotic efficiency

Statistical model

Maximum likelihood estimator

Asymptotic normal

Confidence interval 1)},(|)(Pr{ frC

})ˆ)(()ˆ(:{)( 2T rInrC

Estimative distribution

Asymptotic efficient

Most cited statisticians

Fisher (1912)

Cf. Physics, Medicine, Mathematics

Statistical curvature

Statistical model

Information geometry

Information metric

Mixture connection

Exponential connection

Rao (1949) Amari (1982 )

m‐geodesic and e‐geodesic

m‐geodesic

e‐geodesic

))1,0(()(),( )e( tMpMqp txxA model M is e‐geodesic if

))1,0(()(),( )m( tMpMqp txx

A model M is m‐geodesic if

})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential model

})}({)}({:)({0

)m( xtxtx pp EEpM Mean match model

rNagaoka‐Amari (1982)

KL divergence (1951)

Kullback-Leibler divergence

AIC (Akaike, 1973 )

MLE on Exponential family

})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential family

Likelihood equation

)}({),(log)(1

θtθθxθ

ii npL

}eE{log)( )(T

0xtθθ pCumulant function

)}({E)(: xtθθ

)}({var)(T

θθ θ

Log likelihood function

tηtθθ

MLor)(

})}({:)({)m( ηxtx pEpMMean match model

)(*)}({max)(max tθtθθθθ

LMaximum likelihood

Barndorff-Nielsen (1978), Brown (1986),

Pair of (model, estimation)

Estimation leaf

Amari (1982 )

Non‐exponential family

)1(),,(

t‐distribution

Generalized Pareto distribution

Univariate

p‐variate

Unified view to power exponential family

Gibbs‐Boltzmann‐Shannon

Josiah Willard Gibbs(1839 - 1903)

Claude E. Shannon(1916- 2001)

Ludwig E. Boltzmann(1844 - 1906)

)),((),(log GBS1

GBS-entropy

Log-likelihood

A new estimator if we extend GBS-entropy ?

Hill’s diversity index in 1973

aja pD

1)()( p

1lim)}(log{lim

)(log1

Sjjp j ,....,1for speice a offrequency relative a beLet

Projective Entropy

‐cross entropy

‐diagonal entropy

Cf. Good (1972) Fujisawa‐Eguchi (2008) Eguchi‐Kato (2010)

‐divergence

Remark: The diagonal entropy is equivalent to Tsallis entropy

Ferrari‐Yang (2010)

Three properties

Linearity

scale invariance

Lower Bound

These properties lead to uniqueness for C ?

Theorem

Remark:

Shannon‐Khinchin axiom

Cf. Hanel and Thurner (2011)

Max ‐entropy

-entropy

Equal moment space

Max -entropy

-normal

Eguchi-Komori-Kato (2011)

‐EstimationParametric model

-Loss function

-estimator

Gaussian

-estimator

remark

Escort normal family

Escort distribution

‐estimation on ‐normal model

-loss function

Pythagoras triangle

Loss decomposition

Pythagorian

MLE for normal model

Equal moment space

Max 0-entropy

Gaussian

MLE is the sample mean and variance

Likelihood

Teicher’s theorem (1961)

if and only if g is Gaussian.

Let gbe a location-scale family.

-estimator leads to -normal model

= 0 (Gaussian) Cf.Teicher (1961)

loss function

0-model -model

M-estimator

emergence

(-estimator ’-model)

0-loss( log-likelihood) MLE

-estimator -estimator

U-cross entropy

U is a convex and increasing function

U-entropy

U-entropy and U-divergence

Cf. Eguchi-Kano (2001)

U-divergence

Example 1. Boltzmann-Shannon entropy.

Example 2. Power entropy. Tsallis (1988) Basu et al (1998) Cf. Box-Cox (1964)

Example 3. Sigmoid entropy. Takenouchi-Eguchi (2004)

Tsallis entropy (q=+1)

max U‐entropy model

Mean‐equal space})()(:{)(where

00 XEXEpp pp tt _

Let p0(x) be a pdf and let t (x) be a statistic of dimension d.

)(max)( 0

pHUpp _

Consider a problem

xτtθθ dppUpppL })())(()({),,( 0T The Lagrangian

leads to const. )())(( T xtθxp

))()(()( T θxtθx UUp So , we call U‐model

U‐estimation

U‐estimator

Let x1,…, xn be from p(x) and let p(x) be a model function.

U‐loss function

Consistency

Asymptotic normality

U‐methodU function

U-model

U-divergenceU-entropyU-cross entropy

U-loss

MaxEntEmpirical

U-estimator

(U-estimator, U-model)

(U-estimator, expo-model)

Pythagoras

Statistical analysis

Robustness

Moment estimate

loss function

Uexp = likelihood

U1‐loss

Exponential U1‐model

RobustM‐estimate

RobustU1‐estimotor

Moment estimator

U1‐estimator under exp‐model

Minimax game

Cf. Grunwald and Dawid (2004)

)(maxarg where)(

pHp Upp _

Pattern recognition

Feature vector ),...,( 1 pxxxClass label }1,1{ y

Classifier )( xx fy

Training data )},(,),...,{( 11 nn yyD xx

Statistical classifier ))((sgin)( xx Ff

Set of weak learners

Decision stamp

Linear classifiers

ANNs SVMs k-NNs

No stronger, but exhaustive!

0)(),1()(:setting Initial.1 011

xFniiw n 　　　

,)())((I)( iwfyf tiit x

)(log)b(tt

tttTT fFF

1)( )()( where,)(sign.3 )( xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tftt F

Adaboost

Freund-Schapire (1997).

Learning process

)},( ),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww

)()1( xf

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final decision by

tttT fF

1)( )()( xx

Learning curve

50 100 150 200 250

Iteration

Training error rate

Stopping rule

Training data

Final decision function

tttT fF

1)( )()( xx

How to find T ?

10‐fold CV Error rate

,),( 1niii yD x

1 2 3 4 109

Validation Training

)(101byAveraing

tttT fF

1)( )()( xx

Update of exp‐loss

)}(exp{1)(1

exp ii

)()()( xxx fFF

}))(1()(){(exp fefeFL

Consider one update as

Exponential loss functional

iiiiiii yfeyfeFy

))(I())(I()}(exp{1 xxx

iiiii fyFy

1exp )}(exp{)}(exp{1)( xx

)}(exp{))(()(

FyyfIf

xxwhere

Sequential minimization

}))(1()(){()( expexp efefFLfFL

)()(1log

opt ff

)}(1){(2 ff

)}(1){(2)()(12

Equality iff

efef ))(1()(

Adaboost = sequential min expo‐loss

)(minarg)a( )( ff tFf

)}(exp{)()((c) 1 itittt xfyiwiw

)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL

)(minarg)b( )(1exp ttt fFL

Bayes risk consistency

exp ),()}(exp{)(y

ydPFyFLI xx

Expected loss functional

xxxx xx dfpp FF )()}|1(e)|1(e{ )()(

)|1()|1(

)(opt xx

Optimal discriminant function

xxxxxx xx dfpppp

FLIFLI

)(})|1()|1(2)|1(e)|1(e{

optexpexp

0)()|1(e)|1(e 2)()( }{ xxxx xx dfpp FF

A simulation

-1 -0.5 0.5 1

[-1,1]×[-1,1]

Feature space

Decision boundary

-1 -0.5 0.5 1

Set of linear classifiers

Linear classifier

Random generation

Learning process

-1 -0.5 0 0.5 1

Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08

Final decision

-1 -0.5 0 0.5 1-1

-1 -0.5 0 0.5 1

Contour of F(x) Sign(F(x))

U‐boost for classification

U-boost:

Let (x, y) be a variable with feature vector x and label y

Decision rule

Step 1.

Step 2.

Murata et al (2004)

U‐Boost for density estimation

)( argmin *

* fLf Uf UW

RI1d)))((())((1)( xxx U-loss function

)( ))((co 1* WW U

Dictionary of density functions

},1d)(,0)(:)({ xxxx gggW

Learning space = U-model

)}( ))(({ 1

Goal : find

)())0,1,,0(,(Then.))((),(Let )(

1 )( xxxπx

WW * U

U‐Boost algorithm

)(minargFind)A( 1 gLf Ug W

)))()()1(((minarg),(

st))()()1((Update)B(

)1,0(),(gfLg

kkkkkk

))()()1((ˆand,Select )C( 11

KKKK gffK

Example 2. Power entropy 1

* )( )()( k

kk xgxf

kkk ggf )()(logexp)(,0 If )(* xxx

kk gf )()(,1 If * xx Klemela (2007)

Friedman et al (1984)

Learning?

Boosting is a forward stagewise algorithm

T r u e d e n s i t y

0 . 0 0 5

0 . 0 1

0 . 0 1 5

0 . 0 2

W dictionary

Inner step in the convex hull

)(convex* WW }:),({ xgW

),( 1xg

),( 2xg),( 3xg

),( 4xg

),( 5xg

),( 6xg

),( 7xg

)( argmin *

* fLf Uf W

Goal : )(ˆˆ)(ˆˆ)( ˆˆ11* xxx kk fff

Loss decomposition

),,()()()1()))()()1((( 1 gfgLfLgfL kUUkUkU

))}()()1(())(())(()1{(),,( gfUgUfUgf kkkU where

Note 1.

)a.e.(0),,()2(

,0),,()1(

)},,()({min gfgL kUUg

Min problem：

),,( gfkU)(gLUMake smaller, and larger in g simultaneously.

Theorem. Assume that a data distribution has a density p(x). Then we have

Non‐asymptotic bound

),IE( ),EE(),FA()ˆ,(EI * KppfpD UKUp WW

),(inf),(FA fpDp UgUU=W

|}| ))((EI))((supEI2),(EE1

1{ Xx ggp pi

)constantsare,()(IE UUU

U cbcK

(Functional approximation)

(Estimation error)

(Iteration effect)

),(EEand),(FAbetween Trade Remark. WW pp *U

Naito and Eguchi (2012)

Statistical applications

(kernel) U-PCAU-Boost U-SVM

U-ICA (mixture)

U-Boosting density

Robustness (outliers) (Exp-model, U-loss)Redescendency (hidden structure) (U-model, U-estmate)

U-cluster

Evolve (U-model, U-loss)

Drawback of Boosting

because

Early stopping (Zhang-Yu,2005), Clemela (2008)

Breiman (1996)

Meinshausen and Buhlmann (2011), Stadle‐Buhlmann‐van de Geer (2010)

Poincaré conjecture 1904

Every simply connected, closed 3-manifold is

homeomorphic to the 3-sphere

Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003

Optimal control theory due to Pontryagin and Bellman

Optimal transport

Wasserstein space

Optimal transport

Theorem (Brenier, 1991)

Talagrand inequality

Log Sovolev inequality

Optimal transport theory is extending on a general manifold

Geometers consider not a space, but a distribution family on the space.

C. Villani (2010) Optimal transport theory as Fields medal

Fisher's equation in 1937

Fisher-Kolmogorov equation

Ecology, physiology, combustion, crystallization, plasma physics, phase transition problems cf. Wiki (reaction diffusion process)

Reaction–diffusion systems

Statistical inference for noisy nonlinear ecological dynamic systems

Statistical algorithm

Sufficient statistic

Ricker map

Sampling

Statistical parameter

Simulated curved normal model

Conclusions and Future directions

Two cultures

Optimal transport theory

Simulated likelihood

Simulation modeling

Monte Carlo search

MLE 1912

From log to power function

Dynamics and statistics

Big data

Release Fisher’s mind control

最尤理論、誕生から百年の統計学と今後の方向への一考察Fisher’s equation...

Documents

Fisher’s F-ratio illustrated graphically

Equal Protection and Affirmative Action; Fisher’s Inapt ... Meaning of Equal... · Equal Protection and Affirmative Action; Fisher’s Inapt attempt to Apply a ... This section

尤利西斯 (Ulysses)

A Decision-Theoretic Formulation of Fisher’s Approach to Testingfaculty.washington.edu/kenrice/testingrev2a.pdf · 2010. 11. 12. · A Decision-Theoretic Formulation of Fisher’s

Extending Fisher’s inequality to coveringsusers.monash.edu/~gfarr/research/slides/Horsley-May2016Fisher... · Fisher’s inequality (1940): There is no (v;k; )-design with v

LT deepL RBM R - padoc.infopadoc.info/doc/LT_deepL_RBM_R.pdf · RBMの数理 ③制限ボルツマンマシンの最尤推定最大対数尤度を算出するには、対数尤度Lをθ=[Wij

Fisher’s Top 10 Hot Spots

Frederick Fisher and the legend of Fisher’s Ghost...Frederick Fisher and the legend of Fisher’s Ghost The legend of Fisher’s Ghost is one of Australia’s most well-known ghost

DNA鑑定を理解するために必要な数学の学び方 - 京 …...DNA 鑑定のための最尤推定～離散型から連続型へ、シフトしよう～ • 尤度比を考えているときは、「どちら」の尤度が高いか、を問

Fisher’s linear disc

Frederick Fisher and the legend of Fisher’s Ghost · Frederick Fisher and the legend of Fisher’s Ghost The legend of Fisher’s Ghost is one of Australia’s most well-known ghost

応用音響学 : Baum-Welchアルゴリズム...emアルゴリズムは、不完全データからの対数尤度を最大化するために、完全データからの対数尤度の期待値を最大化する。

第5章glmの尤度比検定と検定の非対称性前編

通信路モデル Bayes 統計最尤推定と MAP 推定データの性質

Speaker: 尤志華

Stat 301 – Day 9 Fisher’s Exact Test Quantitative Variables

Statistics for EES Chi-square tests and Fisher’s exact testevol.bio.lmu.de/_statgen/StatEES/SS10/chisqtest_handout.pdf · Statistics for EES Chi-square tests and Fisher’s exact

Is Fisher’s exact test very conservative? - UGRhera.ugr.es/doi/15018751.pdf · Computational Statistics & Data Analysis 19 (1995) 5799591 North-Holland Is Fisher’s exact test

―最尤法，t値，MNL，NL―bin.t.u-tokyo.ac.jp/startup16/file/2-1.pdf · 目次 0．統計学の導入 1．最尤法 2．t値の意味 3．非集計モデルの導入 4．mnlの導出

Evolution, dispersal of genetics and Fisher’s equation

最尤理論、誕生から百年の統計学と 今後の方向への一考察Fisher’s equation...

最尤理論、誕生から百年の統計学と今後の方向への一考察Fisher’s equation...