Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
江口 真透
統計数理研究所
総研大夏期大学院 2012年9月19日 20日, 統数研
最尤理論、誕生から百年の統計学と
今後の方向への一考察
Shinto Eguchi
Institute of Statistical Mathematics
Summer-semester in Guas, 19-20 September, 2012, ISM
Maximum likelihood theory
from Ronald Fisher in 1912
with discussion on future directions
Outline
MLE 100 years from Fisher 1912
Two cultures in statistics and AdaBoost
Information geometry
Poincaré conjecture and Optimal transport theory
Power entropy and divergence
Likelihood for equation model
R. A. Fisher
Ronald A. Fisher(1890-1962 )
Population geneticsStatistics
analysis of variance Genetical Theory of Natural Selection
maximum likelihood estimation
Wright-Fisher modelDesign of experiment
Randomization test
Linear discriminant analysisFisher’s equation
Fundamental theorem of natural selection
R A Fisher digital archive, Univ of Adelaide
Maximum likelihood
MLE 1912-2012, Aldrich (1997)
Efficiency, sufficiency, Fisher information,…Boltzmann-Shannon entropy, Kullback-Leibler divergenceMax entropy distribution (exponential model)
Fisher (1912)
David Scott Technometrics (2001)
Asymptotic efficiency
Statistical model
Maximum likelihood estimator
Asymptotic normal
Confidence interval 1)},(|)(Pr{ frC
})ˆ)(()ˆ(:{)( 2T rInrC
Estimative distribution
Asymptotic efficient
Most cited statisticians
Fisher (1912)
Fisher (1912)
Fisher (1912)
Fisher (1912)
)
)
)
1
2
3
Fi
sh
er
(1
9
1
2)
Cf. Physics, Medicine, Mathematics
Statistical curvature
Statistical curvature
Statistical model
Information geometry
Information metric
Mixture connection
Exponential connection
Rao (1949) Amari (1982 )
13
m‐geodesic and e‐geodesic
m‐geodesic
e‐geodesic
))1,0(()(),( )e( tMpMqp txxA model M is e‐geodesic if
))1,0(()(),( )m( tMpMqp txx
A model M is m‐geodesic if
})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential model
})}({)}({:)({0
)m( xtxtx pp EEpM Mean match model
p
q
rNagaoka‐Amari (1982)
KL divergence (1951)
Kullback-Leibler divergence
AIC (Akaike, 1973 )
MLE on Exponential family
})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential family
Likelihood equation
)}({),(log)(1
θtθθxθ
Tn
ii npL
}eE{log)( )(T
0xtθθ pCumulant function
)}({E)(: xtθθ
η θ
)}({var)(T
2xtθ
θθ θ
Log likelihood function
tηtθθ
MLor)(
})}({:)({)m( ηxtx pEpMMean match model
)(*)}({max)(max tθtθθθθ
Tn
LMaximum likelihood
Barndorff-Nielsen (1978), Brown (1986),
Pair of (model, estimation)
Model
Estimation leaf
Amari (1982 )
)e(M
)m(M
Non‐exponential family
11
)1(),,(
xxf
t‐distribution
Generalized Pareto distribution
Univariate
p‐variate
Unified view to power exponential family
Gibbs‐Boltzmann‐Shannon
Josiah Willard Gibbs(1839 - 1903)
Claude E. Shannon(1916- 2001)
Ludwig E. Boltzmann(1844 - 1906)
)),((),(log GBS1
1
pHpn
iin
x
GBS-entropy
Log-likelihood
A new estimator if we extend GBS-entropy ?
)(1
11
)(
W
j
qjq p
qkS p
Hill’s diversity index in 1973
aS
j
aja pD
1
1
1)()( p
)(100
log1
1lim)}(log{lim
S
j
aj
aa
ap
aD p
)(log1
pHppS
jjj
Sjjp j ,....,1for speice a offrequency relative a beLet
Projective Entropy
‐cross entropy
‐diagonal entropy
Cf. Good (1972) Fujisawa‐Eguchi (2008) Eguchi‐Kato (2010)
‐divergence
Remark: The diagonal entropy is equivalent to Tsallis entropy
Ferrari‐Yang (2010)
Three properties
Linearity
scale invariance
Lower Bound
These properties lead to uniqueness for C ?
Theorem
Remark:
Proof
Shannon‐Khinchin axiom
Cf. Hanel and Thurner (2011)
Max ‐entropy
-entropy
Equal moment space
Max -entropy
-normal
Eguchi-Komori-Kato (2011)
‐EstimationParametric model
-Loss function
-estimator
Gaussian
-estimator
remark
Escort normal family
Escort distribution
‐estimation on ‐normal model
-loss function
Pythagoras triangle
Loss decomposition
Pythagorian
MLE for normal model
Equal moment space
Max 0-entropy
Gaussian
MLE is the sample mean and variance
Likelihood
Teicher’s theorem (1961)
if and only if g is Gaussian.
Let gbe a location-scale family.
-estimator leads to -normal model
= 0 (Gaussian) Cf.Teicher (1961)
model
loss function
-loss
0-model -model
M-estimator
emergence
(-estimator ’-model)
0-loss( log-likelihood) MLE
-estimator -estimator
U-cross entropy
U is a convex and increasing function
U-entropy
where
U-entropy and U-divergence
Cf. Eguchi-Kano (2001)
U-divergence
Example 1. Boltzmann-Shannon entropy.
Example 2. Power entropy. Tsallis (1988) Basu et al (1998) Cf. Box-Cox (1964)
Example 3. Sigmoid entropy. Takenouchi-Eguchi (2004)
Tsallis entropy (q=+1)
max U‐entropy model
Mean‐equal space})()(:{)(where
00 XEXEpp pp tt _
Let p0(x) be a pdf and let t (x) be a statistic of dimension d.
)(max)( 0
pHUpp _
Consider a problem
xτtθθ dppUpppL })())(()({),,( 0T The Lagrangian
leads to const. )())(( T xtθxp
))()(()( T θxtθx UUp So , we call U‐model
U‐estimation
Note:
U‐estimator
Let x1,…, xn be from p(x) and let p(x) be a model function.
U‐loss function
Consistency
Asymptotic normality
U‐methodU function
U-model
U-divergenceU-entropyU-cross entropy
U-loss
MaxEntEmpirical
U-estimator
(U-estimator, U-model)
(U-estimator, expo-model)
Pythagoras
Statistical analysis
Robustness
Moment estimate
model
loss function
Uexp = likelihood
U1‐loss
Exponential U1‐model
RobustM‐estimate
MLE
RobustU1‐estimotor
Moment estimator
U1‐estimator under exp‐model
42/32
Minimax game
Cf. Grunwald and Dawid (2004)
)(maxarg where)(
*
0
pHp Upp _
Pattern recognition
Feature vector ),...,( 1 pxxxClass label }1,1{ y
Classifier )( xx fy
Training data )},(,),...,{( 11 nn yyD xx
Statistical classifier ))((sgin)( xx Ff
Set of weak learners
Decision stamp
Linear classifiers
ANNs SVMs k-NNs
No stronger, but exhaustive!
0)(),1()(:setting Initial.1 011
xFniiw n
,)())((I)( iwfyf tiit x
)(
)(121
)(
)(log)b(tt
ttt
f
f
T
tttTT fFF
1)( )()( where,)(sign.3 )( xxx
Tt ,,1For .2
))(exp()()()c( )(1 iitttt yfiwiw x
)(min)()a( )( ff tftt F
Adaboost
Freund-Schapire (1997).
Learning process
)},( ),...,,{( 11 nn yy xx
)(,),1( 11 nww
)(,),1( 22 nww
1
2
1
2
T
)()1( xf
T
1)( )(
ttt f x
)(,),1( nww TT
)()2( xf
)()( xTf
Final decision by
T
tttT fF
1)( )()( xx
Learning curve
50 100 150 200 250
0.05
0.1
0.15
0.2
Iteration
Training error rate
Stopping rule
Training data
Final decision function
T
tttT fF
1)( )()( xx
How to find T ?
10‐fold CV Error rate
,),( 1niii yD x
1 2 3 4 109
Validation Training
)1(
Validation Training
)2(
Validation Training
)3(
)(101byAveraing
10
1
kk
T
tttT fF
1)( )()( xx
Update of exp‐loss
)}(exp{1)(1
exp ii
n
i
Fyn
FL x
)()()( xxx fFF
}))(1()(){(exp fefeFL
Consider one update as
Exponential loss functional
n
iiiiiii yfeyfeFy
n 1
))(I())(I()}(exp{1 xxx
n
iiiii fyFy
nfFL
1exp )}(exp{)}(exp{1)( xx
)(
)}(exp{))(()(
exp
1
FL
FyyfIf
n
iiiij
xxwhere
Sequential minimization
}))(1()(){()( expexp efefFLfFL
)()(1log
21
opt ff
)}(1){(2 ff
)}(1){(2)()(12
ffefe
f
Equality iff
efef ))(1()(
Adaboost = sequential min expo‐loss
)(minarg)a( )( ff tFf
t
)}(exp{)()((c) 1 itittt xfyiwiw
)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL
R
)()(1
log21
)(
)(opt
t
t
ff
)(minarg)b( )(1exp ttt fFL
R
Bayes risk consistency
1
exp ),()}(exp{)(y
ydPFyFLI xx
Expected loss functional
xxxx xx dfpp FF )()}|1(e)|1(e{ )()(
)|1()|1(
log21
)(opt xx
x
pp
F
Optimal discriminant function
xxxxxx xx dfpppp
FLIFLI
FF
)(})|1()|1(2)|1(e)|1(e{
)()(
)()(
optexpexp
0)()|1(e)|1(e 2)()( }{ xxxx xx dfpp FF
A simulation
-1 -0.5 0.5 1
-1
-0.5
0.5
1
[-1,1]×[-1,1]
Feature space
Decision boundary
-1 -0.5 0.5 1
-1
-0.5
0.5
1
Set of linear classifiers
Linear classifier
Random generation
Learning process
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10
Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08
Final decision
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Contour of F(x) Sign(F(x))
U‐boost for classification
U-boost:
Let (x, y) be a variable with feature vector x and label y
Decision rule
Step 1.
Step 2.
Murata et al (2004)
U‐Boost for density estimation
)( argmin *
* fLf Uf UW
pfUf
nfL
n
iiU
RI1d)))((())((1)( xxx U-loss function
)( ))((co 1* WW U
Dictionary of density functions
},1d)(,0)(:)({ xxxx gggW
Learning space = U-model
)}( ))(({ 1
xg
Goal : find
)())0,1,,0(,(Then.))((),(Let )(
1 )( xxxπx
gfgf
WW * U
U‐Boost algorithm
)(minargFind)A( 1 gLf Ug W
)))()()1(((minarg),(
st))()()1((Update)B(
111
1111
1
)1,0(),(gfLg
gfff
kUkk
kkkkkk
g
W
))()()1((ˆand,Select )C( 11
KKKK gffK
Example 2. Power entropy 1
* )( )()( k
kk xgxf
k
kk
kkk ggf )()(logexp)(,0 If )(* xxx
k
kk gf )()(,1 If * xx Klemela (2007)
Friedman et al (1984)
Learning?
Boosting is a forward stagewise algorithm
T r u e d e n s i t y
- 4
- 2
0
2
4
- 4
- 2
0
2
4
0
0 . 0 0 5
0 . 0 1
0 . 0 1 5
0 . 0 2
- 4
- 2
0
2
4
W dictionary
Inner step in the convex hull
)(convex* WW }:),({ xgW
*UW
W
)(xf
),( 1xg
),( 2xg),( 3xg
),( 4xg
),( 5xg
),( 6xg
),( 7xg
)(xf
)( argmin *
* fLf Uf W
Goal : )(ˆˆ)(ˆˆ)( ˆˆ11* xxx kk fff
Loss decomposition
),,()()()1()))()()1((( 1 gfgLfLgfL kUUkUkU
))}()()1(())(())(()1{(),,( gfUgUfUgf kkkU where
Note 1.
)a.e.(0),,()2(
,0),,()1(
gfgf
gf
kkU
kU
)},,()({min gfgL kUUg
W
Min problem:
),,( gfkU)(gLUMake smaller, and larger in g simultaneously.
Theorem. Assume that a data distribution has a density p(x). Then we have
Non‐asymptotic bound
),IE( ),EE(),FA()ˆ,(EI * KppfpD UKUp WW
),(inf),(FA fpDp UgUU=W
=W
|}| ))((EI))((supEI2),(EE1
1{ Xx ggp pi
n
ing
p W
W
)constantsare,()(IE UUU
U cbcK
bK
(Functional approximation)
(Estimation error)
(Iteration effect)
),(EEand),(FAbetween Trade Remark. WW pp *U
Naito and Eguchi (2012)
where
Statistical applications
(kernel) U-PCAU-Boost U-SVM
U-ICA (mixture)
U-AUC
U-Boosting density
Robustness (outliers) (Exp-model, U-loss)Redescendency (hidden structure) (U-model, U-estmate)
U-cluster
Evolve (U-model, U-loss)
Drawback of Boosting
○
because
Early stopping (Zhang-Yu,2005), Clemela (2008)
Breiman (1996)
○
○
Meinshausen and Buhlmann (2011), Stadle‐Buhlmann‐van de Geer (2010)
Poincaré conjecture 1904
Every simply connected, closed 3-manifold is
homeomorphic to the 3-sphere
Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003
Optimal control theory due to Pontryagin and Bellman
Optimal transport
Wasserstein space
Optimal transport
Theorem (Brenier, 1991)
Talagrand inequality
Log Sovolev inequality
Optimal transport theory is extending on a general manifold
Geometers consider not a space, but a distribution family on the space.
C. Villani (2010) Optimal transport theory as Fields medal
Fisher's equation in 1937
Fisher-Kolmogorov equation
Ecology, physiology, combustion, crystallization, plasma physics, phase transition problems cf. Wiki (reaction diffusion process)
Reaction–diffusion systems
Statistical inference for noisy nonlinear ecological dynamic systems
Statistical algorithm
Sufficient statistic
Ricker map
Sampling
Statistical parameter
Simulated curved normal model
Conclusions and Future directions
Two cultures
Optimal transport theory
Simulated likelihood
Simulation modeling
Monte Carlo search
MLE 1912
From log to power function
Dynamics and statistics
MCMC
Big data
Release Fisher’s mind control