181
1 UNC, Stat & OR ??? Place ??? ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina July 5, 2022

1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

Embed Size (px)

DESCRIPTION

3 UNC, Stat & OR Statistics - Mathematics Relationship Mathematical Statistics: Validation of existing methods Asymptotics (n  ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)

Citation preview

Page 1: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

11

UNC, Stat & OR

??? Place ?????? Place ???

Object Oriented Data Analysis

J. S. MarronDept. of Statistics and Operations

Research, University of North CarolinaMay 3, 2023

Page 2: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

22

UNC, Stat & OR

Interdisciplinary RelationshipInterdisciplinary Relationship

How does:

StatisticsRelate to:

Mathematics?(probability, optimization, geometry, …)

Page 3: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

33

UNC, Stat & OR

Statistics - Mathematics RelationshipStatistics - Mathematics Relationship

Mathematical Statistics: Validation of existing methods Asymptotics (n ∞) & Taylor

expansion Comparison of existing methods

(requires hard math, butreally “accounting”???)

Page 4: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

44

UNC, Stat & OR

Statistics - Mathematics RelationshipStatistics - Mathematics Relationship

Suggested New Relationship:Put Mathematics to work

toGenerate New Statistical Ideas/Approaches

(publishable in the Ann. Stat.???)

Page 5: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

55

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

What is Mathematical Statistics? Validation of existing methods Asymptotics (n ∞) & Taylor

expansion Comparison of existing methods

(requires hard math, butreally “accounting”???)

Page 6: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

66

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

What could Mathematical Statistics be? Basis for invention of new methods Complicated data mathematical ideas Do we value creativity? Since we don’t do this, others do…

(where are the $$$s???)

Page 7: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

77

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

Since we don’t do this, others do… Pattern Recognition Artificial Intelligence Neural Nets Data Mining Machine Learning ???

Page 8: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

88

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

Possible Litmus Test:Creative Statistics

Clinical Trials Viewpoint:Worst Imaginable Idea

Mathematical Statistics Viewpoint:???

Page 9: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

99

UNC, Stat & OR

Object Oriented Data Analysis, IObject Oriented Data Analysis, I

What is the “atom” of a statistical analysis? 1st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

Page 10: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1010

UNC, Stat & OR

Functional Data Analysis, IFunctional Data Analysis, I

Curves as Data ObjectsImportant Duality:Curve Space Point Cloud SpaceIllustrate with Travis Gaydos Graphics 2 dim’al curves (easy to visualize)

Page 11: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1111

UNC, Stat & OR

Functional Data Analysis, Toy EG IFunctional Data Analysis, Toy EG I

Page 12: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1212

UNC, Stat & OR

Functional Data Analysis, Toy EG IIFunctional Data Analysis, Toy EG II

Page 13: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1313

UNC, Stat & OR

Functional Data Analysis, Toy EG IIIFunctional Data Analysis, Toy EG III

Page 14: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1414

UNC, Stat & OR

Functional Data Analysis, Toy EG IVFunctional Data Analysis, Toy EG IV

Page 15: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1515

UNC, Stat & OR

Functional Data Analysis, Toy EG VFunctional Data Analysis, Toy EG V

Page 16: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1616

UNC, Stat & OR

Functional Data Analysis, Toy EG VIFunctional Data Analysis, Toy EG VI

Page 17: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1717

UNC, Stat & OR

Functional Data Analysis, Toy EG VIIFunctional Data Analysis, Toy EG VII

Page 18: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1818

UNC, Stat & OR

Functional Data Analysis, Toy EG VIIIFunctional Data Analysis, Toy EG VIII

Page 19: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

1919

UNC, Stat & OR

Functional Data Analysis, Toy EG IXFunctional Data Analysis, Toy EG IX

Page 20: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2020

UNC, Stat & OR

Functional Data Analysis, Toy EG XFunctional Data Analysis, Toy EG X

Page 21: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2121

UNC, Stat & OR

Functional Data Analysis, 10-d Toy EG 1Functional Data Analysis, 10-d Toy EG 1

Page 22: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2222

UNC, Stat & OR

Functional Data Analysis, 10-d Toy EG 1Functional Data Analysis, 10-d Toy EG 1

Page 23: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2323

UNC, Stat & OR

Functional Data Analysis, 10-d Toy EG 2Functional Data Analysis, 10-d Toy EG 2

Page 24: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2424

UNC, Stat & OR

Functional Data Analysis, 10-d Toy EG 2Functional Data Analysis, 10-d Toy EG 2

Page 25: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2525

UNC, Stat & OR

Object Oriented Data Analysis, IObject Oriented Data Analysis, I

What is the “atom” of a statistical analysis? 1st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

Page 26: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2626

UNC, Stat & OR

Object Oriented Data Analysis, IIObject Oriented Data Analysis, II

Examples: Medical Image Analysis

Images as Data Objects? Shape Representations as Objects

Micro-arrays for Gene Expression Just multivariate analysis?

Page 27: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2727

UNC, Stat & OR

Object Oriented Data Analysis, IIIObject Oriented Data Analysis, III

Typical Goals: Understanding population variation

Visualization Principal Component Analysis +

Discrimination (a.k.a. Classification) Time Series of Data Objects

Page 28: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2828

UNC, Stat & OR

Object Oriented Data Analysis, IVObject Oriented Data Analysis, IV

Major Statistical Challenge, I:High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless

Can’t “normalize the data” Land of Opportunity for Statisticians

Need for “creative statisticians”

Page 29: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

2929

UNC, Stat & OR

Object Oriented Data Analysis, VObject Oriented Data Analysis, V

Major Statistical Challenge, II: Data may live in non-Euclidean space

Lie Group / Symmet’c Spaces (manifold data)

Trees/Graphs as data objects Interesting Issues:

What is “the mean” (pop’n center)? How do we quantify “pop’n variation”?

Page 30: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3030

UNC, Stat & OR

Statistics in Image Analysis, IStatistics in Image Analysis, I

First Generation Problems: Denoising Segmentation Registration

(all about single images)

Page 31: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3131

UNC, Stat & OR

Statistics in Image Analysis, IIStatistics in Image Analysis, II

Second Generation Problems: Populations of Images

Understanding Population Variation Discrimination (a.k.a. Classification)

Complex Data Structures (& Spaces) HDLSS Statistics

Page 32: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3232

UNC, Stat & OR

HDLSS Statistics in Imaging

Why HDLSS (High Dim, Low Sample Size)?

Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters

Complex 3-d Objects Costly to Segment Often have n = 10’s cases

Page 33: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3333

UNC, Stat & OR

Medical Imaging – A Challenging Medical Imaging – A Challenging ExampleExample

Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment

(cancer) Work with 3-d CT Very Challenging to Segment

Find boundary of each object? Represent each Object?

Page 34: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3434

UNC, Stat & OR

Male Pelvis – Raw DataMale Pelvis – Raw Data

One CT Slice(in 3d image)

Coccyx(Tail Bone)

RectumBladder

Page 35: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3535

UNC, Stat & OR

Male Pelvis – Raw DataMale Pelvis – Raw Data

Bladder:

manual segmentation

Slice by slice

Reassembled

Page 36: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3636

UNC, Stat & OR

Male Pelvis – Raw DataMale Pelvis – Raw Data

Bladder:

Slices:Reassembled in 3d

How to represent?

Thanks: Ja-Yeon Jeong

Page 37: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3737

UNC, Stat & OR

Object RepresentationObject Representation

Landmarks (hard to find) Boundary Rep’ns (no

correspondence) Medial representations

Find “skeleton” Discretize as “atoms” called M-reps

Page 38: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3838

UNC, Stat & OR

3-d m-reps3-d m-reps

Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong)

• Medial Atoms provide “skeleton”• Implied Boundary from “spokes” “surface”

Page 39: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

3939

UNC, Stat & OR

3-d m-reps3-d m-reps

M-rep model fitting• Easy, when starting from binary (blue)• But very expensive (30 – 40 minutes technician’s

time)• Want automatic approach• Challenging, because of poor contrast, noise, …• Need to borrow information across training sample• Use Bayes approach: prior & likelihood

posterior• ~Conjugate Gaussians, but there are issues:

• Major HLDSS challenges• Manifold aspect of data

Page 40: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4040

UNC, Stat & OR

Illuminating ViewpointIlluminating Viewpoint

Object Space Feature Space

Focus here oncollection ofdata objects

Here conceptualize population structure

via “point clouds”

Page 41: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4141

UNC, Stat & OR

Personal HDLSS Viewpoint: Data

Images (cases) are “Points”

In Feature SpaceFeatures are Axes

Data set is “Point Clouds”

Use Proj’ns to visualize

Page 42: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4242

UNC, Stat & OR

Personal HDLSS Viewpoint: PCA

Rotated Axes

Often Insightful

One set of Dir’ns

Others Useful, too

Page 43: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4343

UNC, Stat & OR

Cornea Data, I

Images as data ~42 Cornea ImagesOuter surface of eyeHeat map of curvature (in radial direction)Hard to understand “population structure”

Page 44: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4444

UNC, Stat & OR

Cornea Data, II

PC 1Starts at Pop’n MeanOverall CurvatureVertical AstigmatismCorrelated!Gaussian ProjectionsVisualization: Can’t Overlay

(so use movie)

Page 45: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4545

UNC, Stat & OR

Cornea Data, III

PC 2Horrible Outlier!

(present in data)But look only in center:

Steep at top -- bottomWant Robust PCAFor HDLSS data ???

Page 46: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4646

UNC, Stat & OR

Cornea Data, IV

Robust PC 2No outlier impactSee top – bottom variationProjections now Gaussian

Page 47: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4747

UNC, Stat & OR

PCA for m-reps, I

Major issue: m-reps live in(locations, radius and angles)

E.g. “average” of: = ???

Natural Data Structure is:Lie Groups ~ Symmetric spaces

(smooth, curved manifolds)

)2()3(3 SOSO

359,358,3,2

Page 48: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4848

UNC, Stat & OR

PCA for m-reps, II

PCA on non-Euclidean spaces?(i.e. on Lie Groups / Symmetric Spaces)

T. Fletcher: Principal Geodesic Analysis

Idea: replace “linear summary of data”With “geodesic summary of data”…

Page 49: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

4949

UNC, Stat & OR

PGA for m-reps, Bladder-Prostate-Rectum

Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3

(analysis by Ja Yeon Jeong)

Page 50: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5050

UNC, Stat & OR

PGA for m-reps, Bladder-Prostate-Rectum

Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3

(analysis by Ja Yeon Jeong)

Page 51: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5151

UNC, Stat & OR

PGA for m-reps, Bladder-Prostate-Rectum

Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3

(analysis by Ja Yeon Jeong)

Page 52: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5252

UNC, Stat & OR

HDLSS Classification (i.e. HDLSS Classification (i.e. Discrimination)Discrimination)

Background: Two Class (Binary) version:

Using “training data” from Class +1, and from Class -1

Develop a “rule” for assigning new data to a Class

Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements

Page 53: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5353

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio

Less Useful Methods: Nearest Neighbors Neural Nets

(“black boxes”, no “directions” or intuition)

Page 54: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5454

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Currently Fashionable Methods: Support Vector Machines Trees Based Approaches

New High Tech Method Distance Weighted Discrimination

(DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem

Page 55: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5555

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Currently Fashionable Methods:

Trees Based ApproachesSupport Vector Machines:

Page 56: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5656

UNC, Stat & OR

Kernel Embedding IdeaKernel Embedding Idea

Aizerman, Braverman, Rozoner (1964)

Make data linearly separableby embedding in

higher dimensional space

Page 57: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5757

UNC, Stat & OR

Kernel Embedding IdeaKernel Embedding Idea

Linearly separableby embedding inhigher dimensions

Page 58: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5858

UNC, Stat & OR

Kernel Embedding IdeaKernel Embedding Idea

Linearly separableby embedding inhigher dimensions

Page 59: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

5959

UNC, Stat & OR

Kernel Embedding IdeaKernel Embedding Idea

Linearly separableby embedding inhigher dimensions

Page 60: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6060

UNC, Stat & OR

Kernel Embedding IdeaKernel Embedding Idea

Linearly separableby embedding inhigher dimensions

Page 61: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6161

UNC, Stat & OR

Kernel Embedding IdeaKernel Embedding Idea

Linearly separableby embedding inhigher dimensions

Distributional Assumptions

in Embedded Space?

ǁ˅

Support Vector Machine

Page 62: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6262

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Comparison of Linear Methods (toy data):

Optimal DirectionExcellent, but need dir’n in dim = 50

Maximal Data Piling (J. Y. Ahn, D. Peña)Great separation, but generalizability???

Support Vector MachineMore separation, gen’ity, but some data

piling?Distance Weighted Discrimination

Avoids data piling, good gen’ity, Gaussians?

50,20,2.2,, 21,1 dnnINd

Page 63: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6363

UNC, Stat & OR

Distance Weighted DiscriminationDistance Weighted Discrimination

Maximal Data Piling

Page 64: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6464

UNC, Stat & OR

Distance Weighted DiscriminationDistance Weighted Discrimination

Based on Optimization Problem:

More precisely work in appropriate penalty for violations

Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic

prog’ing Fast greedy solution Can use existing software

n

i ibw r1,

1min

Page 65: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6565

UNC, Stat & OR

Simulation ComparisonSimulation Comparison

E.G. Above Gaussians:Wide array of dim’sSVM Subst’ly worseMD – Bayes OptimalDWD close to MD

Page 66: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6666

UNC, Stat & OR

Simulation ComparisonSimulation Comparison

E.G. Outlier Mixture: Disaster for MD SVM & DWD much

more solid Dir’ns are “robust” SVM & DWD similar

Page 67: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6767

UNC, Stat & OR

Simulation ComparisonSimulation Comparison

E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly betterNote: All methods

come together for larger d ???

Page 68: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6868

UNC, Stat & OR

DWD Bias Adjustment for MicroarraysDWD Bias Adjustment for Microarrays

Microarray data: Simult. Measur’ts of “gene

expression” Intrinsically HDLSS

Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s

My view: Each array is “point in cloud”

Page 69: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

6969

UNC, Stat & OR

DWD Batch and Source AdjustmentDWD Batch and Source Adjustment

For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004)

Bioinformaticshttps://genome.unc.edu/pubsup/dwd/

Adjust for Source Effects Different sources of mRNA

Adjust for Batch Effects Arrays fabricated at different times

Page 70: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7070

UNC, Stat & OR

DWD Adj: Raw Breast Cancer dataDWD Adj: Raw Breast Cancer data

Page 71: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7171

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

Page 72: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7272

UNC, Stat & OR

DWD Adj: Batch ColorsDWD Adj: Batch Colors

Page 73: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7373

UNC, Stat & OR

DWD Adj: Biological Class ColorsDWD Adj: Biological Class Colors

Page 74: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7474

UNC, Stat & OR

DWD Adj: Biological Class Colors & DWD Adj: Biological Class Colors & SymbolsSymbols

Page 75: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7575

UNC, Stat & OR

DWD Adj: Biological Class SymbolsDWD Adj: Biological Class Symbols

Page 76: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7676

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

Page 77: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7777

UNC, Stat & OR

DWD Adj: PC 1-2 & DWD directionDWD Adj: PC 1-2 & DWD direction

Page 78: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7878

UNC, Stat & OR

DWD Adj: DWD Source AdjustmentDWD Adj: DWD Source Adjustment

Page 79: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

7979

UNC, Stat & OR

DWD Adj: Source Adj’d, PCA viewDWD Adj: Source Adj’d, PCA view

Page 80: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8080

UNC, Stat & OR

DWD Adj: Source Adj’d, Class ColoredDWD Adj: Source Adj’d, Class Colored

Page 81: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8181

UNC, Stat & OR

DWD Adj: Source Adj’d, Batch ColoredDWD Adj: Source Adj’d, Batch Colored

Page 82: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8282

UNC, Stat & OR

DWD Adj: Source Adj’d, 5 PCsDWD Adj: Source Adj’d, 5 PCs

Page 83: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8383

UNC, Stat & OR

DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWDDWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

Page 84: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8484

UNC, Stat & OR

DWD Adj: S. & B1,2 vs. 3 AdjustedDWD Adj: S. & B1,2 vs. 3 Adjusted

Page 85: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8585

UNC, Stat & OR

DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCsDWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

Page 86: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8686

UNC, Stat & OR

DWD Adj: S. & B Adj’d, B1 vs. 2 DWDDWD Adj: S. & B Adj’d, B1 vs. 2 DWD

Page 87: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8787

UNC, Stat & OR

DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’dDWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d

Page 88: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8888

UNC, Stat & OR

DWD Adj: S. & B Adj’d, 5 PC viewDWD Adj: S. & B Adj’d, 5 PC view

Page 89: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

8989

UNC, Stat & OR

DWD Adj: S. & B Adj’d, 4 PC viewDWD Adj: S. & B Adj’d, 4 PC view

Page 90: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9090

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Class ColorsDWD Adj: S. & B Adj’d, Class Colors

Page 91: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9191

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Adj’d PCADWD Adj: S. & B Adj’d, Adj’d PCA

Page 92: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9292

UNC, Stat & OR

DWD Bias Adjustment for Microarrays

Effective for Batch and Source Adj. Also works for cross-platform Adj.

E.g. cDNA & Affy Despite literature claiming contrary

“Gene by Gene” vs. “Multivariate” views

Funded as part of caBIG“Cancer BioInformatics Grid”

“Data Combination Effort” of NCI

Page 93: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9393

UNC, Stat & OR

Interesting Benchmark Data SetInteresting Benchmark Data Set

NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available:

http://discover.nci.nih.gov/datasetsNature2000.jsp

Both cDNA and Affymetrix Platforms

8 Major cancer subtypes

Use DWD now for visualization

Page 94: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9494

UNC, Stat & OR

NCI 60: Fully Adjusted Data, NCI 60: Fully Adjusted Data, Leukemia ClusterLeukemia Cluster

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR

Page 95: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9595

UNC, Stat & OR

NCI 60: Views using DWD Dir’ns (focus on NCI 60: Views using DWD Dir’ns (focus on biology)biology)

Page 96: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9696

UNC, Stat & OR

Why not adjust by means?

DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust

(although still not perfect)

Page 97: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9797

UNC, Stat & OR

Twiddle ratios of subtypes

Page 98: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9898

UNC, Stat & OR

DWD in Face Recognition, I

Face Images as Data(with M. Benito & D. Peña)Registered using landmarksMale – Female Difference?Discrimination Rule?

Page 99: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

9999

UNC, Stat & OR

DWD in Face Recognition, II

DWD Direction Good separation Images “make

sense” Garbage at ends?

(extrapolation effects?)

Page 100: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

100100

UNC, Stat & OR

DWD in Face Recognition, III

Interesting summary: Jump between means

(in DWD direction) Clear separation of

Maleness vs. Femaleness

Page 101: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

101101

UNC, Stat & OR

DWD in Face Recognition, IV

Fun Comparison: Jump between means

(in SVM direction) Also distinguishesMaleness vs. Femaleness But not as well as

DWD

Page 102: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

102102

UNC, Stat & OR

DWD in Face Recognition, V

Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

Page 103: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

103103

UNC, Stat & OR

DWD in Face Recognition, VI

Current Work: Focus on “drivers”:

(regions of interest) Relation to Discr’n? Which is “best”? Lessons for human

perception?

Page 104: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

104104

UNC, Stat & OR

Time Series of Curves

Chemical Spectra, evolving over time(with J. Wendelberger & E. Kober)

Mortality curves changing in time(with Andres Alonzo)

Page 105: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

105105

UNC, Stat & OR

Discrimination for m-reps

Classification for Lie Groups – Symm. SpacesS. K. Sen, S. Joshi & M. Foskey

What is “separating plane” (for SVM-DWD)?

Page 106: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

106106

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view

Page 107: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

107107

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view

Page 108: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

108108

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view

Page 109: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

109109

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view

Page 110: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

110110

UNC, Stat & OR

Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view

Blood vessel tree dataBlood vessel tree data

Page 111: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

111111

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view

Page 112: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

112112

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Now look over many people (data objects)Structure of population (understand variation?)PCA in strongly non-Euclidean Space???

, ... ,,

Page 113: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

113113

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Possible focus of analysis:• Connectivity structure only (topology)• Location, size, orientation of segments• Structure within each vessel segment

, ... ,,

Page 114: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

114114

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Present Focus:Topology only

Already challenging Later address others Then add attributes To tree nodes And extend analysis

Page 115: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

115115

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

The tree team: Very Interdsciplinary Neurosurgery: Bullitt, Ladha

Statistics: Wang, Marron

Optimization: Aydin, Pataki

Page 116: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

116116

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Recall from above:Marron’s brain: Focus on back Connectivity (topology) only

Page 117: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

117117

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Present Focus: Topology only Raw data as trees Marron’s reduced tree Back tree only

Page 118: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

118118

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Topology onlyE.g. Back TreesFull PopulationStudy as movieUnderstand variation?

Page 119: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

119119

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

Statistics on Population of Tree-Structured Data Objects?

• Mean???• Analog of PCA???

Strongly non-Euclidean, since:• Space of trees not a linear space• Not even approximately linear

(no tangent plane)

Page 120: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

120120

UNC, Stat & OR

Mildly Non-Euclidean Mildly Non-Euclidean SpacesSpaces

Useful View of Manifold Data: Tangent Space

Center:Frechét Mean

Reason forterminology“mildly nonEuclidean”

Page 121: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

121121

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

Mean of Population of Tree-Structured Data Objects?

Natural approach: Fréchet mean

Requires a metric (distance)on tree space

n

ii

xxXdX

1

2,minarg

Page 122: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

122122

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space?• Recall Conventional PCA:• Directions that explain structure in

data

• Data are points in point cloud• 1-d and 2-d projections allow insights

about population structure

Page 123: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

123123

UNC, Stat & OR

Illust’n of PCA View: PC1 Illust’n of PCA View: PC1 ProjectionsProjections

Page 124: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

124124

UNC, Stat & OR

Illust’n of PCA View: Projections on PC1,2 Illust’n of PCA View: Projections on PC1,2 planeplane

Page 125: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

125125

UNC, Stat & OR

Source Batch Adj: PC 1-3 & DWD Source Batch Adj: PC 1-3 & DWD directiondirection

Page 126: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

126126

UNC, Stat & OR

Source Batch Adj: DWD Source Source Batch Adj: DWD Source AdjustmentAdjustment

Page 127: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

127127

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space?Key Idea (Jim Ramsay):• Replace 1-d subspace

that best approximates data• By 1-d representation

that best approximates dataWang and Marron (2007) define notion of

Treeline (in structure space)

Page 128: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

128128

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space: Treeline • Best 1-d representation of dataBasic idea:• From some starting tree• Grow only in 1 “direction”

Page 129: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

129129

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space: Treeline • Best 1-d representation of dataProblem: Hard to compute• In particular: to solve optimization problemWang and Marron (2007)• Maximum 4 vessel trees• Hard to tackle serious trees

(e.g. blood vessel trees)

Page 130: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

130130

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space: Treeline Problem: Hard to computeSolution: Burcu Aydin & Gabor Pataki

(linear time algorithm)(based on clever

“reformulation” of problem)

Page 131: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

131131

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

PCA on Tree Space: Treelines Interesting to compare:• Population of Left Trees• Population of Right Trees• Population of Back TreesAnd to study 1st, 2nd, 3rd & 4th treelines

Page 132: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

132132

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Study“Directions”

1, 2, 3, 4For sub-populations

B, L, R(interpret

later)

Page 133: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

133133

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Notes on Treeline Directions:• PC1 always to left• BACK has most variation to right

(PC2)• LEFT has more varia’n to 2nd level

(PC2)• RIGHT has more var’n to 1st level

(PC2)See these in the data?

Page 134: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

134134

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Notes:PC1 – all leftPC2:BACK - right LEFT 2nd levRIGHT 1st levSee these??

Page 135: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

135135

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space: Treeline Next represent data as projections• Define as closest point in tree line

(same as Euclidean PCA)• Have corresponding score

(length of projection along line)• And analog of residual

(distance from data point to projection)

Page 136: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

136136

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Individual (each PC separately) Scores Plot

Page 137: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

137137

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Data Analytic Goals: Age, Gender

See these?

No…

Page 138: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

138138

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Directly study age PC scoresPC1 + PC2- Thickness Not Sig’t- Descendants Left Sig’t

Page 139: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

139139

UNC, Stat & OR

Upcoming New ApproachUpcoming New Approach

Replace Tree-Lines by Tree-Curves:

Page 140: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

140140

UNC, Stat & OR

Upcoming New ApproachUpcoming New Approach

Projections on Tree-Curves:

Page 141: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

141141

UNC, Stat & OR

Preliminary Tree-Curve Preliminary Tree-Curve ResultsResults

First Correlation

OfStructure

To Age!

(BackTrees)

Page 142: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

142142

UNC, Stat & OR

Preliminary Tree-Curve Preliminary Tree-Curve ResultsResults

But does not appeareverywhere

(LeftTrees)

Findinglocality!

Page 143: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

143143

UNC, Stat & OR

HDLSS Asymptotics

Why study asymptotics?

Page 144: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

144144

UNC, Stat & OR

HDLSS Asymptotics

Why study asymptotics? An interesting (naïve) quote:“I don’t look at asymptotics, because I don’t have an infinite sample size”

Page 145: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

145145

UNC, Stat & OR

HDLSS Asymptotics

Why study asymptotics? An interesting (naïve) quote:

“I don’t look at asymptotics, because I don’t have an infinite sample size”

Suggested perspective:Asymptotics are a tool for finding simple

structure underlying complex entities

Page 146: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

146146

UNC, Stat & OR

HDLSS Asymptotics

Which asymptotics? n ∞ (classical, very widely done) d ∞ ??? Sensible? Follow typical “sampling process”? Say anything, as noise level increases???

Page 147: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

147147

UNC, Stat & OR

HDLSS Asymptotics

Which asymptotics? n ∞ & d ∞ n >> d: a few results around

(still have classical info in data) n ~ d: random matrices (Iain J., et al)

(nothing classically estimable) HDLSS asymptotics: n fixed, d ∞

Page 148: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

148148

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”?

Page 149: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

149149

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”? Microarrays: # genes bounded Proteomics, SNPs, …

A moot point, from perspective: Asymptotics are a tool for finding simple

structure underlying complex entities

Page 150: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

150150

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases???

Page 151: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

151151

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases???

Yes, there exists simple, perhapssurprising, underlying structure

Page 152: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

152152

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, I

For dim’al “Standard Normal” dist’n:

Euclidean Distance to Origin (as ):

- Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”??? - Paradox resolved by:

“density w. r. t. Lebesgue Measure”

d

d

dd

d

INZ

ZZ ,0~

1

)1(pOdZ

d

Page 153: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

153153

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, II

For dim’al “Standard Normal” dist’n: indep. of

Euclidean Dist. between and (as ):Distance tends to non-random constant:

Can extend to Where do they all go???

(we can only perceive 3 dim’ns)

d

d

dd INZ ,0~2

)1(221 pOdZZ

1Z

1Z 2Z

nZZ ,...,1

Page 154: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

154154

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, III

For dim’al “Standard Normal” dist’n: indep. of

High dim’al Angles (as ):

- -“Everything is orthogonal”??? - Where do they all go???

(again our perceptual limitations) - Again 1st order structure is non-random

d

d

dd INZ ,0~2

)(90, 2/121

dOZZAngle p

1Z

Page 155: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

155155

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, I

Assume , let Study Subspace Generated by

Dataa. Hyperplane through 0, of

dimension b. Points are “nearly

equidistant to 0”, & dist c. Within plane, can “rotate

towards Unit Simplex”d. All Gaussian data sets

are“near Unit Simplex Vertices”!!!

“Randomness” appears only in rotation of simplex

n

d ddn INZZ ,0~,...,1

d

d

With P. Hall & A. Neeman

Page 156: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

156156

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, II

Assume , let

Study Hyperplane Generated by Data

a. dimensional hyperplane

b. Points are pairwise equidistant, dist

c. Points lie at vertices of “regular hedron”

d. Again “randomness in data” is only in rotation

e. Surprisingly rigid structure in data?

1n

d ddn INZZ ,0~,...,1

d2d~

n

Page 157: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

157157

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, III

Simulation View: shows “rigidity after rotation”

Page 158: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

158158

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, III

Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” (with P. Hall & A. Neeman)

Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi)

Mixing Condition on Stand’d & Permuted Var’s

(with S. Jung)

All based on simple “Laws of Large Numbers”

Page 159: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

159159

UNC, Stat & OR

2nd Paper on HDLSS Asymptotics

Ahn, Marron, Muller & Chi (2007) Biometrika Assume 2nd Moments (and Gaussian) Assume no eigenvalues too large in sense:

For assume i.e. (min possible)

(much weaker than previous mixing conditions…)

d

jj

d

jj

d1

2

2

1

)(1 do 1 d

Page 160: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

160160

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, IV

Explanation of Observed (Simulation) Behavior:“everything similar for very high d”

2 popn’s are 2 simplices (i.e. regular n-hedrons)

All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN

Page 161: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

161161

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, V

Further Consequences of Geometric Representation

1. Inefficiency of DWD for uneven sample size(motivates “weighted version”, work in progress)

2. DWD more “stable” than SVM(based on “deeper limiting distributions”)(reflects intuitive idea “feeling sampling

variation”)(something like “mean vs. median”)

3. 1-NN rule inefficiency is quantified.

Page 162: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

162162

UNC, Stat & OR

HDLSS Math. Stat. of PCA, I

Consistency & Strong Inconsistency:Spike Covariance Model (Johnstone & Paul)For Eigenvalues: 1st Eigenvector:

How good are empirical versions,as estimates?

1,,1, ,,2,1 dddd d

1u

1,,1 ˆ,ˆ,,ˆ uddd

Page 163: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

163163

UNC, Stat & OR

HDLSS Math. Stat. of PCA, II

Consistency (big enough spike):For ,

Strong Inconsistency (spike not big enough):For ,

1

0ˆ, 11 uuAngle

1

011 90ˆ, uuAngle

Page 164: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

164164

UNC, Stat & OR

HDLSS Math. Stat. of PCA, III

Consistency of eigenvalues?

Eigenvalues Inconsistent But known distribution Unless as well

nn

dL

d

2

,1,1̂

n

Page 165: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

165165

UNC, Stat & OR

HDLSS Work in Progress, I

Batch Adjustment: Xuxin LiuRecall Intuition from above: Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust

Mathematics behind this?

Page 166: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

166166

UNC, Stat & OR

Liu: Twiddle ratios of subtypes

Page 167: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

167167

UNC, Stat & OR

HDLSS Data Combo Mathematics

Xuxin Liu Dissertation Results: Simple Unbalanced Cluster Model Growing at rate as Answers depend on

Visualization of setting….

d d

Page 168: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

168168

UNC, Stat & OR

HDLSS Data Combo Mathematics

Page 169: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

169169

UNC, Stat & OR

HDLSS Data Combo Mathematics

Page 170: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

170170

UNC, Stat & OR

HDLSS Data Combo Mathematics

Asymptotic Results (as ):

For , DWD Consistent

Angle(DWD,Truth)

For , DWD Strongly Inconsistent

Angle(DWD,Truth)

d

21

21

0

090

Page 171: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

171171

UNC, Stat & OR

HDLSS Data Combo Mathematics

Asymptotic Results (as ):

For , PAM Inconsistent

Angle(PAM,Truth)

For , DWD Strongly Inconsistent

Angle(PAM,Truth)

d

21

21

0 rC

090

Page 172: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

172172

UNC, Stat & OR

HDLSS Data Combo Mathematics

Value of , for sample size ratio :

, only when

Otherwise for , PAM Inconsistent

Verifies intuitive idea in strong way

rC

22

1cos2

1

r

rCr

0rC

r

1r

1r

Page 173: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

173173

UNC, Stat & OR

HDLSS Work in Progress, II

Canonical Correlations: Myung Hee Lee Results similar to those for those for

PCA Singular values inconsistent But directions converge under a much

milder spike assumption.

Page 174: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

174174

UNC, Stat & OR

HDLSS Work in Progress, III

Conditions for Geo. Rep’n & PCA Consist.:John Kent example:

Can only say: not deterministic

Conclude: need some flavor of mixing

dddddd ININX *100,021,0

21~

212/1212/1

2/1

..10

..)(

pwdpwd

dOX p

Page 175: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

175175

UNC, Stat & OR

HDLSS Work in Progress, III

Conditions for Geo. Rep’n & PCA Consist.:Conclude: need some flavor of mixing

Challenge: Classical mixing conditionsrequire notion of time ordering

Not always clear, e.g. microarrays

Page 176: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

176176

UNC, Stat & OR

HDLSS Work in Progress, III

Conditions for Geo. Rep’n & PCA Consist.:Sungkyu Jung Condition: whereDefine:Assume: Ǝ a permutation, So that is ρ-mixing

ddX ,0~ tdddd UU

dtddd XUZ 2/1

d

ddZ

Page 177: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

177177

UNC, Stat & OR

HDLSS Deep Open Problem

In PCA Consistency: Strong Inconsistency - spike Consistency - spike

What happens at boundary ( )???

1

1

1

Page 178: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

178178

UNC, Stat & OR

The Future of HDLSS Asymptotics?

1. Address your favorite statistical problem…2. HDLSS versions of classical optimality results?3. Continguity Approach (~Random Matrices)4. Rates of convergence?5. Improved Discrimination Methods?

It is early days…

Page 179: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

179179

UNC, Stat & OR

The Future of Geometrical Representation?

HDLSS version of “optimality” results? “Contiguity” approach? Params depend on

d? Rates of Convergence? Improvements of DWD?(e.g. other functions of distance than inverse)

It is still early days …

Page 180: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

180180

UNC, Stat & OR

Some Carry Away Lessons

Atoms of the Analysis: Object Oriented Viewpoint: Object Space Feature Space DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations

(Modulo rotation, have constant simplex shape) How to put HDLSS asymptotics to work?

Page 181: 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January

181181

UNC, Stat & OR

Object Oriented Data AnalysisObject Oriented Data Analysis

Potential Future Opportunity:

OODA SAMSI Program2010-2011

Interested in joining? Let’s talk