The Intersection of Statistics with Geometry, Information, and Riemannian Manifolds

8/4/2019 The Intersection of Statistics with Geometry, Information, and Riemannian Manifolds

http://slidepdf.com/reader/full/the-intersection-of-statistics-with-geometry-information-and-riemannian-manifolds 1/52

The Intersection of Statistics withGeometry, Information, and Riemannian Manifolds

by Roger Bilisoly, PhD

Department of Mathematical Sciences

Central Connecticut State UniversitySeptember 2nd, 2011



Overview of Talk

1. Geometric view of 1st year statistics: summary statistics,t-test, two sample t-test, correlation

2. Statistical inference, Fisher information, exponentialfamilies of PDFs, and Riemannian manifolds

3. Sampling data from manifolds



Geometry of 1st Year Topics:Summary Statistics

• Idea: represent data as vectors in Rn, where n = sample size.Many basic statistical ideas then can be cast geometrically.Let’s look at summary statistics:

• The Mean – Let data = {x1, x2, x3, …, xn}.

– We want to summarize this data set by 1 number, say x. – So model = {x, x, x, …, x}. – Use the L2 norm to minimize distance between data and model . – Enough to minimize distance2. – distance2 = (data – model)2 = Sum[(x-xi)

2, {i,1,n}] – Solution = mean of data (Just solve D[distance2, x] == 0), which is

an example of least squares.

• The Standard deviation – Use the L2 norm again on data and model . – Variance = SSE/df = Sum[(xi – m)2]/(n – 1) = distance2 /(n – 1), which

is an example of sum of squares (SS).

– We’ll see more about SS in ANOVA and regression.



Geometry of 1st Year Topics:Summary Statistics (Cont.)

• The Median – Let data = {x1, x2, x3, …, xn}.

– Let model = {x, x, x, …, x}.

– Use L1 norm to minimize distance between data and model .

– The solution: x = median (n odd) – What about n even?

• The Midrange – As above, but use L norm to minimize distance between data

and model .

– The solution: x = midrange = (mini xi + maxi xi)/2.

• L2 has an inner product – This allows the definition of cos( ).

– As seen on last slide, L2 arises in sums of squares and leastsquares, which makes it ubiquitous in statistics.



1 Sample T-Test

• Subsample of 10 human body temp’s was taken from:http://www.amstat.org/publications/jse/v4n2/datasets.shoemaker.html

• H0: μ = 98.6 F vs. H1: μ 98.6 F – temps = {97.2,98.7,98.4,98.2,98.4,98.2,97.1,96.7,98.3,98.0}

– model = {98.6,98.6,98.6,98.6,98.6,98.6,98.6,98.6,98.6,98.6}

– Let u = {1,1,1,1,1,1,1,1,1,1}/Sqrt[10] (|u| = 1) – Projection of temps on model = (temps.u)u = {1,1,1,1,1,1,1,1,1,1}*97.92

– error = temps – (temps.u)u ={0.72,0.78,0.48,0.28,0.48,0.28,-0.82,-1.22,0.38,0.08}

– s = (SSE/df) = Sqrt[Fold[Plus,0,error^2]/9] = 0.671317

The TTEST Procedure

StatisticsLower CL Upper CL Lower CL Upper CL

Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

temps 10 97.44 97.92 98.4 0.4618 0.6713 1.2256 0.2123

T-Tests

Variable DF t Value Pr > |t|

t 9 -3.20 0.0108

http://www.amstat.org/publications/jse/v4n2/datasets.shoemaker.html




Picture of T-Test

tempserror

(temps.u)u

ArcCos[temps.u/Norm[temps]]/Pi*180 = 0.373

• Human body temp data – temps = {97.2,98.7,98.4,98.2,98.4,98.2,97.1,96.7,98.3,98.0}

– model = {98.6,98.6,98.6,98.6,98.6,98.6,98.6,98.6,98.6,98.6}

– Projection of temps on model = (temps.u)u =

{1,1,1,1,1,1,1,1,1,1}*97.92 – error = temps – (temps.u)u =

{0.72,0.78,0.48,0.28,0.48,0.28,-0.82,-1.22,0.38,0.08}



Geometry of Comparing Body Temperaturesby Gender with ANOVA (Long Example)

• Sample of 5 male and 5 female body temp’s taken from:http://www.amstat.org/publications/jse/v4n2/datasets.shoemaker.html

• Standard parametric test is the 2-sample t-test.

• data={98.7,99.4,98.8,98.2,99.2,98.0,99.1,98.8,98.0,98.2}

– First 5 values are females, last 5 are males.• ANOVA table from SAS given below (PROC ANOVA)

• Geometric aspect here: First two rows represent anorthogonal subspace of R10.

The ANOVA Procedure

Dependent Variable: temp

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 1 0.48400000 0.48400000 2.06 0.1892

Error 8 1.88000000 0.23500000

Corrected Total 9 2.36400000

SSE





A Basis for R10

The first row of basis represents the overall mean.

The second row of basis represents the contrast of males vs. females.

The last 8 rows span the error subspace.

Next we’ll perform

Gram-Schmidt via

Mathematica’s

Orthogonalize[].



An Orthonormal Basis for R10

As before, the first row of orthonormal represents the overall mean.

The second row of orthonormal represents the contrast of males vs. females.

The last 8 rows span the error subspace.



Projecting into the following three subspaces:mean, model, and error

Sample of 5 male and 5 female body temp’s taken from:


We have modeled:

yij = μ + genderi + errorij

where R10 is broken into

three orthogonal

subspaces, one for eachterm in the model.

y11 = 98.7 = 98.64 + 0.22 +

(0.015+0.225-0.05-0.35),

and so forth.





The ANOVA Table RepresentsOrthogonal Decomposition

The ANOVA Procedure

Dependent Variable: temp

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 1 0.48400000 0.48400000 2.06 0.1892

Error 8 1.88000000 0.23500000

Corrected Total 9 2.36400000

ssModel = sum of the squares of the data projected into the model subspace.

Since projection = {0.22,0.22,0.22,0.22,0.22,-0.22,-0.22,-0.22,-0.22,-0.22},

ssModel = 0.222 + 0.222 + … + (-0.22)2 = 10*0.0484 = 0.484.



The Schematic Picture for theGender Body Temperatures ANOVA Model

See Figure 6.3 on page 105 of Saville and Wood (1991) for another example

of ANOVA. It also has examples of more complex experimental designs,

regression, and ANCOVA.

Remember that:

data = {98.7,99.4,98.8,98.2,99.2,98.0,99.1,98.8,98.0,98.2}mean = {1,1,1,1,1,1,1,1,1,1}/Sqrt[10]

model = {1,1,1,1,1,-1,-1,-1,-1,-1}/Sqrt[10]

data

(data.mean)mean

(data.model)model

DiagonalMatrix[(error.data)].error



Regression of Log(Distance) of Planets vs. Rank

• Bode’s Law – Claim that

log(distance) can bepredicted by rank

– Includes Ceres

(now a dwarf planet) – Excludes Pluto (now

a dwarf planet)



Least Square Solution =Minimizing Distance(data, model) in R9

Log(Dist) datax values

Model

Objective function

Regression model: Log(distance) = a0 + a1 x



2nd Point of View:Project Data onto Model in R9

Model direction

Log(Dist) data

Projection

Prediction



Correlation of LogDist vs. Rank(Bode’s Law Data)

R = 0.9966



Correlation = cos( ) in R9

y

i

x

i

y x

ii

s

y y

s

x x

nssn

y y x xr

)()(

1

1

)1(

))((

Note that RHS is almost the mean of product of the z-scores of the data.

One can show that as vectors in R23, the z-scores of the data have norms

of df =(n-1), so r = dot product in R23 of unit vectors = cos( ):

)cos()(

1

1)(

1

1

y

i

x

i

s

y y

ns

x x

nr



Apply r = cos( ) to Bode’s Law

Pearson Correlation(Dot product of z-scores)/d.f. = cos( )

Angle between size and calories vectors



And so forth …

• The above examples don’t exhaust the connections

between statistics and geometry.

– For example, PCA rotates point clouds in n-dimensional spacewhere n = number of variables.

• But let us consider another geometric point of view …



Statistical Inference, Fisher Information, ExponentialFamilies of PDFs, and Riemannian Manifolds

• It turns out that one can make deeper connectionsbetween statistics and geometry.

– Probability distributions can be represented by differentiablemanifolds.

– Some types of data can be represented by differentiablemanifolds.

• However, this connection requires some theory

• Let’s start with an overview to see how this connection is

going to be made.



The Big Picture

• Hypothesis testing projects a data set onto aparameterized family of probability distributionfunctions (pdf’s), say .

– Call the best fit f.

• A subset of , say 0, is called the null hypothesis, H0.• If f is close enough to H0, then we don’t reject H0,

otherwise we reject H0.

The above outline raises some questions:What families are good to use?What subsets 0 are interesting?How do we project data onto ?

How do we measure the distance between two pdf’s?



1. Statistical Estimation

• One approach is to believe that data are sampled from a pdfthat parameterized.

– Example: one can assume data comes from a normal distributionwith mean, μ, and variance, σ2.

• Statistics are functions of the sample that estimateparameters.

• What properties should a summary statistic have?

– There is no agreement on this.

• Consider unbiasedness:

– For normal population, E(sample mean) = μ.

– However, E(s) = c4(n)σ, where c4(n) 1 – 1/(4n).*

– Let Xi ~ Binomial(n, p), then there is no unbiased estimator of 1/p.

*See http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation




Maximum Likelihood (ML) Methods

• Here we’ll focus on ML, which has nice mathematicalproperties and also performs well in practice.

• Assume that independent data comes from pdf f(x| ) – = vector of parameters

– xi ~ f(x| ) – x = (x1, x2, …, xn) ~ f(xi| ) = joint pdf

• Likelihood = L( |xi) = f(xi| ) – The likelihood is the joint pdf considered as a function of

– L( |xi) can be viewed as a manifold, but more on this later.

• ML finds the max that maximizes L( |xi) – max is the most likely value to produce the observed data.

• max is our estimate of the population parameters



Example: Likelihood forNormal Human Body Temp.

Same data as before

Normal distribution

Likelihood

MLE is at the peak of the

likelihood curve.



MLE: μ=97.92, σ=0.637

For the normal distribution,

the MLEs for μ and σ are the

sample mean and the

population standard

deviation.

Solving this numerically

involves the geometry of hill-

climbing. Gradient ascent

iterates toward the maximum

by moving in direction of

gradient.

In practice, gradient ascent

can converge slowly. Here a

second-order method would

converge much quicker.See http://en.wikipedia.org/wiki/Gradient_descent

Sample standard deviation

Population standard deviation

http://en.wikipedia.org/wiki/Gradient_descent




Good Properties of MLEs

• Below, is a vector of parameters for a family of pdf’s.

• Theorem. If max is the MLE of , then g( max) is a MLE of g( ).

• Theorem. If f(x| ) is “well behaved,” then max is consistent. That

is, max true, almost surely.• Theorem. If f(x| ) is “well behaved,” then max is asymptotically

normal: (n) ( max – true) N(0,I-1) as n , where I is the

Fisher information matrix.

– Because this normal distribution has mean 0, max is asymptotically

unbiased.

See Wikipedia.org for more info: http://en.wikipedia.org/wiki/Maximum_likelihood

http://en.wikipedia.org/wiki/Maximum_likelihood




2. Fisher Information

• Last slide said that (n) ( max – true) N(0, I-1), where I

is the Fisher information. So what is I? Why is it“informative?” For simplicity, let be a scalar:

dx X f X f

X f E X f E

X f Var I

|log|

|log|log

|(log)(

2

2

2

2

2

2



• An defined in information theory, H(X) is the entropy ofrandom variable X.

• So I(X) = 2 / 2 H(X).

• However, why is H(X) a reasonable measure of information?

– Let’s consider the case of a discrete random variable, X.

– Then H(X) = i pi log(pi).

– Information theory uses axiomatic approach and shows that entropygives unique functional form. See Section 6 of Shannon (1948).

– Statistical mechanics derives H(X) from counting microstates usingthe multinomial coefficient and Stirling’s approximation.

• Log(C) = log(N!/ i(Npi!) ) ~ i pi log(pi). See section on entropy inChapter 8 of Ambegaokar (1996).

• Applying this to real physics (e.g., gases) requires ergodic theory.

dx X f X f X H |log|)(



Cramér-Rao Lower Bound (CRLB):Another Way to View I as Informative

• Let γ(X) be an unbiased estimator of g( ).

• Let αi = / i E(γ(X)).

• Let Iij = -E(2 / i j log(f(x| )) ), where x = sample vector.

– The matrix I is called the Fisher information matrix.

• Then Var(γ(X)) αtI-1α is the CRLB.

• Variance measures uncertainty: the lower the variance,the better the accuracy of the estimator γ(X) of g( ). – The closer Var(γ(X)) is to the CLRB, the better γ(X) is.



Example of CRLB: Exponential Distribution

• Let Xi ~ Exponential( ) = e-x/ / f(x| ), and suppose wewant to estimate .

• log(f(x| )) = -x/ – log( ).

• / log(f(x| )) = x/ 2 – 1/ = (x – )/ 2• I1 = E((x – )/ 2)2 = 1/ 4 E((x – )2) = 2 / 4 = 1/ 2, which is

the information contained in 1 data point. So I = nI1 = n/ 2.

• Hence the CRLB = I-1 = 2 /n.

• Since the sample mean has variance 2 /n, it is theuniformly minimum variance unbiased estimator (UMVUE)of .



3. The Exponential Family of PDFs

• All the usual parametric distributions seen in a first year statscourse come from the exponential family of pdf’s.

• This family has special properties, including special geometricproperties.

• Define f(x| ) = exp(i i( )Ti(x) – A()) g(x), where the sum is overthe parameters.

• Example: Binomial distribution – For n fixed, then iid Binomial(n, p) ~ nCx px (1 – p)n-x =

– nCx (p/(1 – p))x (1 – p)n = exp(xi log(p/(1 – p)) + n log(1 – p) ) nCx

– So set T(x) = xi, (p) = log(p/(1 – p)), A(p) = n log(1 – p), g(x) = nCx • Example: Normal distribution

– We can write the iid normal pdf as a 2 parameter exponential pdf:

– Exp( / 2 xi – 1/(22) xi2 - n/(22) 2) 1/((2 2 ))n



Sufficient Statistics

• The natural parameterization for an exponential pdf is: – f(x| ) = exp(i iTi(x) – A()) g(x) (*)

– That is, i are the parameters: i( ) i.

• For a parametric family of pdf’s parameterized by , a

statistic T(X) is sufficient for X, if the distribution X|T(X) isindependent of . – The intuition is that T(X) is as informative as X with respect to the

parameters .

• Theorem. For X with a full rank exponential family pdf

with natural parameterization (*), T (T1, T2, …, Ts) issufficient. – xi is sufficient for p for iid binomial data.

– xi and xi2 are sufficient for ( , 2) for iid normal data.



4. Combining 1, 2, and 3 to CreateRiemannian Manifolds

• In part 1, we considered the likelihood = L( |xi) = f(xi| ), as afunction to optimize over given the data {xi}.

• In part 2, we introduced the Fisher information matrix withentries: Iij = -E(2 / i j log(f(x| )) ).

• In part 3, we noted that the exponential family of pdf’s isimportant: f(x| ) = exp(i i( )Ti(x) – A()) g(x).

• In this part, we consider likelihoods of exponential family pdf’s as Riemannian manifolds with the metric given by the Fisherinformation matrix.

• Why Riemannian manifolds? – The metric allows us to measure distances intrinsically.

– Hypothesis testing boils down to measuring distances to competinghypotheses.

– All the theory has been worked out.



Example: Extrinsic View of the Unit Sphere

Here we map (- /2, /2) x (0, 2) R3 using spherical coordinates: r = 1,

latitude, and longitude.

The surface is 2 dimensional:(latitude, longitude), but the

graphical representation is 3

dimensional.

Intrinsic point of view means

imagining a 2 dimensional spacewhere distances and areas are

not Euclidean.



Example: Intrinsic View of the Unit Sphere

Sphere = [0, 2] x [- /2, /2] = closed rectangle.

Let v = latitude and u = longitude.

Metric for sphere: ds2 = E du2 + 2F du dv + G dv2 = Cos[v]2 du2 + dv2.

Area element for sphere: dA = (EG – F2) = Cos[v] du dv.

E, F, and G

For dA calculation

What is surface area of a sphere?

4)cos(2 /

2 /

2

0v ududvv

What is length of the equator?

This is defined by v = 0 dv = 0.

2

02

udu



• Normal distribution’s likelihood is – Exp( / 2 xi – 1/(22) xi

2 - n/(22) 2) 1/((2 2 ))n

• The manifold is a half-plane: – ( , 2 ) [-, ] x [0, ]

• Metric = Fisher information (n=1):

• Let var 2. Then we have ds2 given by: – This has constant Riemannian curvature.

– See Skovgaard (1984).

• According to p. 553 of Gray (1993), this is called thegeneralized Poincare metric on the upper half plane.

Example of a Statistical Manifold

4

2

210

01

2

222

var2

var

var

d d ds



5. Example 1, p. 1193, of Efron (1975):An Application of Statistical Curvature

Example 1 of Efron (1975): Bivariate normal with fixed coefficient of variation.

Let X ~ N( , I), where = ( , (c/2) 2). Efron defines statistical curvature by the below, which is a function of M.

322

2

22

222

1

1

c

c

cc

cc M



Sampling Data from Riemannian Manifolds

• Not all data is Euclidean!

• For instance, directional data isn’t Euclidean.

– Angles are cyclic: 350 + 20 = 10

• Small range of angles can be treated with the usualtechniques.

– Example: Geological faults can be strongly aligned.

• Large range of angles require a different approach.

– Example: Wind directions vary greatly over a large area.



http://earthquake.usgs.gov/earthquakes/recenteqscanv/FaultMaps/122-38.html

Accessed 8/19/2011.

http://www.iwindsurf.com/windandwhere.iws?regionID=193&regionProductID=2&timeoffset=1

Accessed 8/19/2011.

Two examples of directional data.









N. Fisher’s and Circular Data

• The following is from section 2.3.1 of Fisher (1993).

• Suppose data = { 1, 2, …, n}, where the angles aremeasured in radians.

• He defines mean direction, mean, as follows:

– Let C = cos( i), and S = sin( i) – Let R2 = C2 + S2

– Let cos( mean) = C/R and sin( mean) = S/R

– Then mean = arctan(S/C) for S > 0, C > 0, or

– mean = arctan(S/C) + for C < 0, or

– mean = arctan(S/C) + 2 for S < 0, C > 0.

• The above is computing the direction of the vector sumof the unit vectors, (cos( i), sin( i)). – This makes this approach extrinsic.



Intrinsic Approach as stated in Krakowski (2002)

From section 4.1, page 47, of Krzysztof Krakowski’s dissertation from the

Department of Mathematics and Statistics at The University of Western Australia.

Let M be a complete Riemannian manifold with distance function d. Let Q be

a finite sample of points from M.

Definition 4.1.1 Let Q : M R be a function defined as follows:

Define the Riemannian variance 2(Q) as the global minimum of Q . The

Riemannian mean is the set of points for which Q = 2(Q). It can be shownthat the Riemannian mean is always non-empty.

Qq

Q q xd Q

x 2),(||

1)(



Application to the Movements of Sea Stars

• Let’s compare Fisher’s mean to the Riemannian meanusing the following angle data (in degrees): – {0, 1, 3, 3, 8, 13, 16, 18, 30, 31, 43, 45, 147, 298, 329, 332, 335,

340, 350, 354, 356, 357}

• mean = 3.1• ListPlot of {Cos[ ],Sin[ ]} below.

Values are from data set B.11

on p. 245 from Fisher (1993).



Riemannian Mean

• For the Riemannian mean we minimize the following: – f[x_,data_]:=Mean[Map[If[#>Pi,(2Pi-#)^2,#^2]&,Abs[data-x] ] ]

Minimum is at 7.6°,

compared to 3.1°.

0.134 radians



Last Example: Applying Manifoldsto Multivariate Data

• Suppose there is a multivariate data set, say bodymeasurements of a sample of n men (see next slide).

• Goal: Analyze typical body shape.

• The vector of averages of each variable need not be atypical body shape.

• Let’s apply a Riemannian manifold approach based on

the work of David Kendall.

– See Kendall (1984) and Kendall (1989).



Transforming the Data

• Since analyzing overall shape is the goal, for eachmale’s data vector, translate so that the mean is zero

and rescale so that the variance is 1.

– (xi1, xi2, …, xik ) (zi1, zi2, …, zik ) s.t j zij = 0 and j zij 2 = 1, i.

• Let zi = (zi1, zi2, …, zik ), then {zi} is a sample of n pointson the intersection of the (k-1)-sphere, Sk-1 and the plane j zij = 0, which is a (k-2)-sphere, Sk-2.

• Sk-2 is a manifold, and so we can say a typical body is

given by the Riemannian mean with variability measuredby the Riemannian standard deviation.



We Take a 3 Variable Subsetof the Full Data set

We restrict ourselves to just the neck, abdomen and thigh measurements.



Plot of Transformed Variables vs.Plots of Original Variables

Since we started with k = 3 variables,

the result is a circle (S1), so we can

use circular data methods for an

approximate answer.

Two 3D scatterplots of the

original measurements.



References

• Ambegaokar (1996). Reasoning about Luck: Probability and Its Uses in Physics,Cambridge, 1996.

• Bradley Efron (1975). “Defining the Curvature of a Statistical Problem (with Applications toSecond Order Efficiency),” The Annals of Statistics , 3, 1189-1242.

• N. I. Fisher (1993). Statistical Analysis of Circular Data , Cambridge.• Alfred Gray (1993). Modern Differential Geometry of Curves and Surfaces , CRC.

• David Kendall (1984). “Shape Manifolds, Procrustean Metrics, and Complex ProjectiveSpaces,” Bulletin of the London Mathematical Society , 16, 81-121.

• David Kendall (1984). “A Survey of the Statistical Theory of Shape,” Statistical Sciences , 4,87-120.

• Krzysztof Krakowski (2002). Geometrical Methods of Inference . Dissertation fromDepartment of Mathematics and Statistics, The University of Western Australia.

• E. L. Lehmann (1983). Theory of Point Estimation , Wiley.• David Saville and Graham Wood (1991). Statistical Methods: The Geometric Approach ,

Springer.• C. E. Shannon (1948). “A Mathematical Theory of Communication,” The Bell System

Technical Journal , 27, 379-423.• Lene Skovgaard (1984). “A Riemannian Geometry of the Multivariate Normal Model,”

Scandinavian Journal of Statistics , 11, 211-223.



References: Web Pages

• Normal human body temperature data – http://www.amstat.org/publications/jse/v4n2/datasets.shoemaker.html – http://en.wikipedia.org/wiki/Normal_human_body_temperature

• Biasedness of sample standard deviation – http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

• Gradient ascent

– http://en.wikipedia.org/wiki/Gradient_descent • Properties of MLEs

– http://en.wikipedia.org/wiki/Maximum_likelihood

• Examples of directional data – http://earthquake.usgs.gov/earthquakes/recenteqscanv/FaultMaps/122-38.html – http://www.iwindsurf.com/windandwhere.iws?regionID=193&regionProductID=2&

timeoffset=1 • Examples of circular distributions – http://en.wikipedia.org/wiki/Wrapped_normal_distribution – http://en.wikipedia.org/wiki/Von_Mises_distribution


http://en.wikipedia.org/wiki/Normal_human_body_temperature







http://en.wikipedia.org/wiki/Wrapped_normal_distribution

http://en.wikipedia.org/wiki/Von_Mises_distribution











http://en.wikipedia.org/wiki/Normal_human_body_temperature




Differential Geometry and Statistics …

• Interest is this connection is old: – Dates back to a paper by Rao (1945), “Information and accuracy

attainable in the estimation of statistical parameters.”

• Research took off in the late 1970s and 1980s:

– Shun-Ichi Amari, Ole Barndorff-Nielsen active• Books have been written:

– Differential Geometry in Statistical Inference by Amari (1987)

– Differential Geometry and Statistics by Murray and Rice (1993)

• Field called information geometry is active now: – Methods of Information Geometry by Amari and Nagaoka (2007)

• BUT the results don’t seem to be applied to practicalproblems.



What Distribution is Best for Circular Data?

• Fisher (1993) suggests several distributions, but thefollowing two are the most popular:

– Wrapped normal distribution

– von Mises distribution

k

k 2

2

2

2exp

2

1

See http://en.wikipedia.org/wiki/Wrapped_normal_distribution and

http://en.wikipedia.org/wiki/Von_Mises_distribution.

0

cos

2 I

ex

However, neither of these have all the properties one expects

from working with the normal distribution with Euclidean data.







Computation of Fisher Informationfor the Normal Distribution

Documents

The Intersection of Statistics with Geometry, Information, and Riemannian Manifolds