Object Orie’d Data Analysis, Last Time Mildly Non-Euclidean Spaces Strongly Non-Euclidean Spaces...

Object Orie’d Data Analysis, Last Time

• Mildly Non-Euclidean Spaces

• Strongly Non-Euclidean Spaces– Tree spaces

– No Tangent Plane

• Classification - Discrimination– Mean Difference (Centroid) method

– Fisher Linear Discrimination• Graphical Viewpoint (nonparametric)

• Maximum Likelihood Derivation (Gaussian based)

Classification - Discrimination

Background: Two Class (Binary) version:

Using “training data” from

Class +1 and Class -1

Develop a “rule” for

assigning new data to a Class

Canonical Example: Disease Diagnosis

• New Patients are “Healthy” or “Ill”

• Determined based on measurements

Important Distinction: Classification vs. Clustering

Classification:Class labels are known,

Goal: understand differencesClustering:

Goal: Find class labels (to be similar)Both are about clumps of similar data,

but much different goals

Important Distinction:

Classification vs. Clustering

Useful terminology:

Classification: supervised learning

Clustering: unsupervised learning

Fisher Linear Discrimination

Graphical Introduction (non-Gaussian):

Classical Discrimination

Above derivation of FLD was:

• Nonstandard

• Not in any textbooks(?)

• Nonparametric (don’t need Gaussian

• I.e. Used no probability distributions

• More Machine Learning than Statistics

Classical DiscriminationSummary of FLD vs. GLR:• Tilted Point Clouds Data

– FLD good– GLR good

• Donut Data– FLD bad– GLR good

• X Data– FLD bad– GLR OK, not great

Classical Conclusion: GLR generally better

(will see a different answer for HDLSS data)

Classical DiscriminationFLD Generalization II (Gen. I was GLR)

Different prior probabilitiesMain idea:

Give different weights to 2 classes• I.e. assume not a priori equally likely• Development is “straightforward”• Modified likelihood• Change intercept in FLD• Won’t explore further here

Classical DiscriminationFLD Generalization III

Principal Discriminant Analysis• Idea: FLD-like approach to > two

classes• Assumption: Class covariance

matrices are the same (similar)(but not Gaussian, same situation as for

FLD)• Main idea:

Quantify “location of classes” by their means

k ,...,, 21

Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find “interesting

directions” among the means:PCA on set of means

i.e. Eigen-analysis of “between class covariance matrix”

Aside: can show: overall

)()1(1 k

wB nkn

Classical DiscriminationPrincipal Discriminant Analysis (cont.)

But PCA only works like Mean

Difference,

Expect can improve bytaking covariance into account.

Blind application of above ideas

suggests eigen-analysis of:Bw 1

Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:• smarter ways to compute

(“generalized eigenvalue”)• other representations

(this solves optimization prob’s)Special case: 2 classes,

reduces to standard FLDGood reference for more: Section 3.8

of:Duda, Hart & Stork (2001)

Classical DiscriminationSummary of Classical Ideas:• Among “Simple Methods”

– MD and FLD sometimes similar– Sometimes FLD better– So FLD is preferred

• Among Complicated Methods– GLR is best– So always use that

• Caution:– Story changes for HDLSS settings

HDLSS Discrimination

Recall main HDLSS issues:

• Sample Size, n < Dimension, d

• Singular covariance matrix

• So can’t use matrix inverse

• I.e. can’t standardize (sphere) the data

(requires root inverse covariance)

• Can’t do classical multivariate analysis

HDLSS DiscriminationAn approach to non-invertible

covariances:• Replace by generalized inverses• Sometimes called pseudo inverses• Note: there are several• Here use Moore Penrose inverse• As used by Matlab (pinv.m)• Often provides useful results

(but not always)Recall Linear Algebra Review…

Recall Linear Algebra

Eigenvalue Decomposition:

For a (symmetric) square matrix

Find a diagonal matrix

And an orthonormal matrix

(i.e. )

So that: , i.e.

ddtt IBBBB

DBBX tBDBX

Recall Linear Algebra (Cont.)

Eigenvalue Decomp. solves matrix problems:

• Inversion:

• Square Root:

• is positive (nonn’ve, i.e. semi) definite

Moore-Penrose Generalized Inverse:

tr BBX

0,,0,, 11 drr

Easy to see this satisfies the definition of

Generalized (Pseudo) Inverse

• symmetric

XXXXXX

Moore-Penrose Generalized Inverse:

Idea: matrix inverse on non-null space of the corresponding linear transformation

Reduces to ordinary inverse, in full rank case,i.e. for r = d, so could just always use this

Tricky aspect:“>0 vs. = 0” & floating point arithmetic

HDLSS DiscriminationApplication of Generalized Inverse to

Direction (Normal) Vector:

Intercept:

Have replaced by

)2()1(ˆ XXn wFLD

)2()1(

1ˆ w wˆ

HDLSS DiscriminationToy Example: Increasing Dimension

data vectors:

• Entry 1:

Class +1: Class –1:

• Other Entries:

• All Entries Independent

Look through dimensions,

1,2.2N

1000,,2,1 d

Increasing Dimension Example

Proj. on

Opt’l Dir’n

Proj. on

FLD Dir’n

Proj. on

both Dir’ns

Add a 2nd Dimension (noise)

Same Proj. on

Opt’l Dir’n

Axes same

as dir’ns

Now See 2

Dim’ns

Add a 3rd Dimension (noise)

Project on

2-d subspace

generated

by optimal

dir’n & by

FLD dir’n

HDLSS DiscriminationMovie Through Increasing Dimensions

HDLSS DiscriminationFLD in Increasing Dimensions:• Low dimensions (d = 2-9):

– Visually good separation– Small angle between FLD and Optimal– Good generalizability

• Medium Dimensions (d = 10-26):– Visual separation too good?!?– Larger angle between FLD and Optimal– Worse generalizability– Feel effect of sampling noise

HDLSS DiscriminationFLD in Increasing Dimensions:• High Dimensions (d=27-37):

– Much worse angle– Very poor generalizability– But very small within class variation– Poor separation between classes– Large separation / variation ratio

HDLSS DiscriminationFLD in Increasing Dimensions:• At HDLSS Boundary (d=38):

– 38 = degrees of freedom(need to estimate 2 class means)

– Within class variation = 0 ?!?– Data pile up, on just two points– Perfect separation / variation ratio?– But only feels microscopic noise aspects

So likely not generalizable– Angle to optimal very large

FLD in Increasing Dimensions:

• Just beyond HDLSS boundary (d=39-

– Improves with higher dimension?!?

– Angle gets better

– Improving generalizability?

– More noise helps classification?!?

HDLSS DiscriminationFLD in Increasing Dimensions:• Far beyond HDLSS boun’ry (d=70-

1000):– Quality degrades– Projections look terrible

(populations overlap)– And Generalizability falls apart, as well– Math’s worked out by Bickel & Levina

(2004)– Problem is estimation of d x d

covariance matrix

HDLSS DiscriminationSimple Solution:

Mean Difference (Centroid) Method• Recall not classically recommended

– Usually no better than FLD– Sometimes worse

• But avoids estimation of covariance• Means are very stable• Don’t feel HDLSS problem

HDLSS DiscriminationMean Difference (Centroid) Method

Same Data,

Movie overdim’s

HDLSS DiscriminationMean Difference (Centroid) Method• Far more stable over dimensions• Because is likelihood ratio solution

(for known variance - Gaussians)• Doesn’t feel HDLSS boundary• Eventually becomes too good?!?

Widening gap between clusters?!?• Careful: angle to optimal grows• So lose generalizabilityHDLSS data present some odd effects…

Maximal Data Piling

Strange FLD effect at HDLSS boundary:

Data Piling:

For each class,

all data

project to

single

Maximal Data PilingWhat is happening?

• Hard to imagine

• Since our intuition is 3-dim’al

• Came from our ancestors…

Try to understand data piling with

some simple examples

Maximal Data PilingSimple example (Ahn & Marron 2005):

Let be the hyperplane:

• Generated by Class +1

• Which has dimension = 1

• I.e. line containing the 2 points

Similarly, let be the hyperplane

• Generated by Class -1

2 nn 3

Maximal Data PilingSimple example: in

Let be

• Parallel shifts of

• So that they pass through the origin

• Still have dimension 1

• But now are subspaces

2 nn 3

Maximal Data PilingSimple example: in2 nn 3

Construction 1:

Let be

• Subspace generated by

• Two dimensional

• Shown as cyan plane

2 nn 3

Construction 1 (cont.):

Let be

• Direction orthogonal to

• One dimensional

• Makes Class +1 Data project to one point

• And Class -1 Data project to one point

• Called Maximal Data Piling Direction

2 nn 3

Construction 2:

Let be

• Subspaces orthogonal to

(respectively)

• Projection collapses Class +1

• Projection collapses Class -1

• Both are 2-d (planes)

2 nn 3

Construction 2 (cont.):

Let intersection of be

• Same Maximal Data Piling Direction

• Projection collapses both

Class +1 and Class -1

• Intersection of 2-d (planes) is 1-d dir’n

2 nn 3

& MDPv

Maximal Data PilingGeneral Case: in with

Let be

• Hyperplanes generated by Classes

• Of Dimensions (resp.)

Let be

• Parallel subspaces

• I.e. shifts to origin

• Of Dimensions (resp.)

nn , d

1,1 nn

Let be

• Orthogonal Hyperplanes

• Of Dim’ns (resp.)

• Where

– Proj’n in Dir’ns Collapse Class +1

– Proj’n in Dir’ns Collapse Class -1

• Expect intersection

nn , d

1,1 ndnd

Can show (Ahn & Marron 2005):

• Most dir’ns in intersection collapse all to 0

• But there is a direction,

• Where Classes collapse to different points

• Unique in subspace generated by the data

• Called Maximal Data Piling Direction

nn , d nnd

Maximal Data PilingMovie Through Increasing Dimensions

Maximal Data PilingFLD in Increasing Dimensions:• Sub HDLSS dimensions (d = 1-37):

– Looks similar to FLD?!?– Reason for this?

• At HDLSS Boundary (d = 38):– Again similar to FLD…

Maximal Data PilingFLD in Increasing Dimensions:• For HDLSS dimensions (d = 39-1000):

– Always have data piling– Gap between grows for larger n– Even though noise increases?– Angle (gen’bility) first improves, d = 39–

180– Then worsens, d = 200-1000– Eventually noise dominates– Trade-off is where gap near optimal

diff’nce

Maximal Data PilingHow to compute ?Can show (Ahn & Marron 2005):

Recall FLD formula:

Only difference is global vs. within classCovariance Estimates!

XXvMDP1ˆ

Maximal Data Piling

Historical Note:

• Discovery of MDP

• Came from a programming error

• Forgetting to use within class

covariance

• In FLD…

Maximal Data PilingVisual similarity of & ?

Can show (Ahn & Marron 2005), for d<n:

I.e. directions are the same!

• How can this be?

• Note lengths are different…

• Study from transformation viewpoint

MDPv FLDv

FLDFLDMDPMDP vvvv //

Maximal Data PilingRecall

Transfo’

view of

Maximal Data PilingInclude Corres-pondingMDP Transfo’:

Both giveSameResult!

Maximal Data PilingDetails:FLD, sep’ing plane normal vectorWithin Class, PC1 PC2Global, PC1 PC2

Maximal Data Piling

Acknowledgement:

• This viewpoint

• I.e. insight into why FLD = MDP

(for low dim’al data)

• Suggested by Daniel Peña

Maximal Data Piling

Fun e.g:

rotate

from PCA

to MDP

dir’ns

Maximal Data PilingMDP for other class labellings:

• Always exists

• Separation bigger for natural clusters

• Could be used for clustering– Consider all directions

– Find one that makes largest gap

– Very hard optimization problem

– Over 2n possible directions

Maximal Data PilingA point of terminology (Ahn & Marron

2005):

MDP is “maximal” in 2 senses:

1. # of data piled

2. Size of gap

(within subspace

gen’d by data)

Maximal Data PilingRecurring, over-arching, issue:

HDLSS space is a weird place

Kernel EmbeddingAizerman, Braverman and Rozoner

(1964) • Motivating idea:

Extend scope of linear discrimination,By adding nonlinear components to data

(embedding in a higher dim’al space)

• Better use of name:nonlinear discrimination?

Kernel EmbeddingToy Examples:In 1d, linear separation splits the

domaininto only 2 parts

Kernel EmbeddingBut in the

“quadratic embedded

domain”,

linear separation can give 3 parts

22 :, xxx

Kernel EmbeddingBut in the quadratic embedded domain

Linear separation can give 3 parts• original data space lies in 1d manifold• very sparse region of • curvature of manifold gives:

better linear separation• can have any 2 break points

(2 points line)

22 :, xxx

Kernel EmbeddingStronger effects for higher order polynomial embedding:

E.g. for cubic,

linear separation can give 4 parts (or fewer)

332 :,, xxxx

Kernel EmbeddingStronger effects - high. ord. poly.

embedding:• original space lies in 1-d manifold,

even sparser in • higher d curvature gives:

improved linear separation• can have any 3 break points

(3 points plane)?• Note: relatively few

“interesting separating planes”

Kernel EmbeddingGeneral View: for original data

matrix:

add rows:

i.e. embed inHigherDimensionalspace

212111

Kernel EmbeddingEmbeddedFisher Linear Discrimination:

Choose Class 1, for any when:

in embedded space.• image of class boundaries in original

space is nonlinear• allows more complicated class regions• Can also do Gaussian Lik. Rat. (or

others) • Compute image by classifying points

from original space

)2()1(1)2()1()2()1(10 ˆ2

1ˆ XXXXXXx wwt

Object Orie’d Data Analysis, Last Time Mildly Non-Euclidean Spaces Strongly Non-Euclidean Spaces...

Documents

The Banach-Tarski paradox in euclidean spaces

Euclidean Motion Group Representations and the Singular ...yazici/pub_jour12.pdf · the two homogeneous spaces of the Euclidean motion group [14,17]. Starting with a convolution representation

Bi-Lipshitz Embedding of Ultrametric Cantor Sets Into Euclidean Spaces

Two-Dimensional Spaces, Volume 3 NON-EUCLIDEAN GEOMETRY ... · Two-Dimensional Spaces, Volume 3 NON-EUCLIDEAN GEOMETRY AND CURVATURE 10.1090/mbk/110

Conformal Differential Geometry. Curves in conformal euclidean … · 2014-09-02 · Mathematics. - Conformal Differential Geometry. Curves in conformal euclidean spaces. By J. HAANTJES

EXTREMAL FUNCTIONS IN DE BRANGES AND EUCLIDEAN SPACESlittmann/research/De... · EXTREMAL FUNCTIONS IN DE BRANGES AND EUCLIDEAN SPACES EMANUEL CARNEIRO AND FRIEDRICH LITTMANN Abstract

The Radon transform on Euclidean spaces ... - klein.mit.edu

Non-Euclidean Spaces: Closed Universes...Alan Guth, Non-Euclidean Spaces: Closed Universes 8.286 Lecture 11, October 17, 2013, p. 3. Equivalent Statements of the 5th Postulate (d)

Object Orie’d Data Analysis, Last Time

Chapter 3 Euclidean Vector Spaces

Limitations of Learning Via Embeddings in Euclidean Half Spaces

EUCLIDEAN VECTOR SPACES - · PDF filechapter 4. euclidean vector spaces •euclidean n –space •linear transformation rn to rm •properties of linear transfromation rn to rm •linear

CHAPTER 6 Inner Product Spaces - UDESC - CCT · (5) EXAMPLE 4 Matrices Generating Weighted Euclidean Inner Products The standard Euclidean and weighted Euclidean inner products are

Geometric Analysis on Euclidean and Homogeneous Spaces ...Geometric Analysis on Euclidean and Homogeneous Spaces Travel Time Tomography and Tensor Tomography Gunther Uhlmann UC Irvine

Geometric Analysis on Euclidean and Homogeneous Spaces ...math.tufts.edu/faculty/equinto/workshop2012/talks/Tufts_Uhlmann.pdf · Geometric Analysis on Euclidean and Homogeneous Spaces

Lattices - MSRIlibrary.msri.org/books/Book44/files/06hwl.pdf · LATTICES 129 2. Lattices Euclidean vector spaces. A Euclidean vector space is a ﬁnite-dimensional vec-torspaceE overtheﬁeldR

Towards Persistence-Based Reconstruction in Euclidean ... · Towards Persistence-Based Reconstruction in Euclidean Spaces 3 1 Introduction The problem of reconstructing unknown structures

Surfaces in Euclidean Spaces, by Steffen Frohlich

SINGULAR INTEGRALS IN QUANTUM EUCLIDEAN SPACES · Introduction 1 1. Quantum Euclidean spaces 7 1.1. Crossed product form 8 1.2. Metrics and derivations 13 1.3. Quantum Euclidean variables

Anisotropic Triebel-Lizorkin Spaces with Doubling Measuresdarkwing.uoregon.edu/~mbownik/papers/30.pdf · from Euclidean spaces to other domains and nonisotropic settings. The usual