Upload
cameron-bennett
View
221
Download
0
Embed Size (px)
DESCRIPTION
3 UNC, Stat & OR Statistics - Mathematics Relationship Mathematical Statistics: Validation of existing methods Asymptotics (n ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)
Citation preview
11
UNC, Stat & OR
??? Place ?????? Place ???
Object Oriented Data Analysis
J. S. MarronDept. of Statistics and Operations
Research, University of North CarolinaMay 3, 2023
22
UNC, Stat & OR
Interdisciplinary RelationshipInterdisciplinary Relationship
How does:
StatisticsRelate to:
Mathematics?(probability, optimization, geometry, …)
33
UNC, Stat & OR
Statistics - Mathematics RelationshipStatistics - Mathematics Relationship
Mathematical Statistics: Validation of existing methods Asymptotics (n ∞) & Taylor
expansion Comparison of existing methods
(requires hard math, butreally “accounting”???)
44
UNC, Stat & OR
Statistics - Mathematics RelationshipStatistics - Mathematics Relationship
Suggested New Relationship:Put Mathematics to work
toGenerate New Statistical Ideas/Approaches
(publishable in the Ann. Stat.???)
55
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
What is Mathematical Statistics? Validation of existing methods Asymptotics (n ∞) & Taylor
expansion Comparison of existing methods
(requires hard math, butreally “accounting”???)
66
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
What could Mathematical Statistics be? Basis for invention of new methods Complicated data mathematical ideas Do we value creativity? Since we don’t do this, others do…
(where are the $$$s???)
77
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
Since we don’t do this, others do… Pattern Recognition Artificial Intelligence Neural Nets Data Mining Machine Learning ???
88
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
Possible Litmus Test:Creative Statistics
Clinical Trials Viewpoint:Worst Imaginable Idea
Mathematical Statistics Viewpoint:???
99
UNC, Stat & OR
Object Oriented Data Analysis, IObject Oriented Data Analysis, I
What is the “atom” of a statistical analysis? 1st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects
1010
UNC, Stat & OR
Functional Data Analysis, IFunctional Data Analysis, I
Curves as Data ObjectsImportant Duality:Curve Space Point Cloud SpaceIllustrate with Travis Gaydos Graphics 2 dim’al curves (easy to visualize)
1111
UNC, Stat & OR
Functional Data Analysis, Toy EG IFunctional Data Analysis, Toy EG I
1212
UNC, Stat & OR
Functional Data Analysis, Toy EG IIFunctional Data Analysis, Toy EG II
1313
UNC, Stat & OR
Functional Data Analysis, Toy EG IIIFunctional Data Analysis, Toy EG III
1414
UNC, Stat & OR
Functional Data Analysis, Toy EG IVFunctional Data Analysis, Toy EG IV
1515
UNC, Stat & OR
Functional Data Analysis, Toy EG VFunctional Data Analysis, Toy EG V
1616
UNC, Stat & OR
Functional Data Analysis, Toy EG VIFunctional Data Analysis, Toy EG VI
1717
UNC, Stat & OR
Functional Data Analysis, Toy EG VIIFunctional Data Analysis, Toy EG VII
1818
UNC, Stat & OR
Functional Data Analysis, Toy EG VIIIFunctional Data Analysis, Toy EG VIII
1919
UNC, Stat & OR
Functional Data Analysis, Toy EG IXFunctional Data Analysis, Toy EG IX
2020
UNC, Stat & OR
Functional Data Analysis, Toy EG XFunctional Data Analysis, Toy EG X
2121
UNC, Stat & OR
Functional Data Analysis, 10-d Toy EG 1Functional Data Analysis, 10-d Toy EG 1
2222
UNC, Stat & OR
Functional Data Analysis, 10-d Toy EG 1Functional Data Analysis, 10-d Toy EG 1
2323
UNC, Stat & OR
Functional Data Analysis, 10-d Toy EG 2Functional Data Analysis, 10-d Toy EG 2
2424
UNC, Stat & OR
Functional Data Analysis, 10-d Toy EG 2Functional Data Analysis, 10-d Toy EG 2
2525
UNC, Stat & OR
Object Oriented Data Analysis, IObject Oriented Data Analysis, I
What is the “atom” of a statistical analysis? 1st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects
2626
UNC, Stat & OR
Object Oriented Data Analysis, IIObject Oriented Data Analysis, II
Examples: Medical Image Analysis
Images as Data Objects? Shape Representations as Objects
Micro-arrays for Gene Expression Just multivariate analysis?
2727
UNC, Stat & OR
Object Oriented Data Analysis, IIIObject Oriented Data Analysis, III
Typical Goals: Understanding population variation
Visualization Principal Component Analysis +
Discrimination (a.k.a. Classification) Time Series of Data Objects
2828
UNC, Stat & OR
Object Oriented Data Analysis, IVObject Oriented Data Analysis, IV
Major Statistical Challenge, I:High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless
Can’t “normalize the data” Land of Opportunity for Statisticians
Need for “creative statisticians”
2929
UNC, Stat & OR
Object Oriented Data Analysis, VObject Oriented Data Analysis, V
Major Statistical Challenge, II: Data may live in non-Euclidean space
Lie Group / Symmet’c Spaces (manifold data)
Trees/Graphs as data objects Interesting Issues:
What is “the mean” (pop’n center)? How do we quantify “pop’n variation”?
3030
UNC, Stat & OR
Statistics in Image Analysis, IStatistics in Image Analysis, I
First Generation Problems: Denoising Segmentation Registration
(all about single images)
3131
UNC, Stat & OR
Statistics in Image Analysis, IIStatistics in Image Analysis, II
Second Generation Problems: Populations of Images
Understanding Population Variation Discrimination (a.k.a. Classification)
Complex Data Structures (& Spaces) HDLSS Statistics
3232
UNC, Stat & OR
HDLSS Statistics in Imaging
Why HDLSS (High Dim, Low Sample Size)?
Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters
Complex 3-d Objects Costly to Segment Often have n = 10’s cases
3333
UNC, Stat & OR
Medical Imaging – A Challenging Medical Imaging – A Challenging ExampleExample
Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment
(cancer) Work with 3-d CT Very Challenging to Segment
Find boundary of each object? Represent each Object?
3434
UNC, Stat & OR
Male Pelvis – Raw DataMale Pelvis – Raw Data
One CT Slice(in 3d image)
Coccyx(Tail Bone)
RectumBladder
3535
UNC, Stat & OR
Male Pelvis – Raw DataMale Pelvis – Raw Data
Bladder:
manual segmentation
Slice by slice
Reassembled
3636
UNC, Stat & OR
Male Pelvis – Raw DataMale Pelvis – Raw Data
Bladder:
Slices:Reassembled in 3d
How to represent?
Thanks: Ja-Yeon Jeong
3737
UNC, Stat & OR
Object RepresentationObject Representation
Landmarks (hard to find) Boundary Rep’ns (no
correspondence) Medial representations
Find “skeleton” Discretize as “atoms” called M-reps
3838
UNC, Stat & OR
3-d m-reps3-d m-reps
Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong)
• Medial Atoms provide “skeleton”• Implied Boundary from “spokes” “surface”
3939
UNC, Stat & OR
3-d m-reps3-d m-reps
M-rep model fitting• Easy, when starting from binary (blue)• But very expensive (30 – 40 minutes technician’s
time)• Want automatic approach• Challenging, because of poor contrast, noise, …• Need to borrow information across training sample• Use Bayes approach: prior & likelihood
posterior• ~Conjugate Gaussians, but there are issues:
• Major HLDSS challenges• Manifold aspect of data
4040
UNC, Stat & OR
Illuminating ViewpointIlluminating Viewpoint
Object Space Feature Space
Focus here oncollection ofdata objects
Here conceptualize population structure
via “point clouds”
4141
UNC, Stat & OR
Personal HDLSS Viewpoint: Data
Images (cases) are “Points”
In Feature SpaceFeatures are Axes
Data set is “Point Clouds”
Use Proj’ns to visualize
4242
UNC, Stat & OR
Personal HDLSS Viewpoint: PCA
Rotated Axes
Often Insightful
One set of Dir’ns
Others Useful, too
4343
UNC, Stat & OR
Cornea Data, I
Images as data ~42 Cornea ImagesOuter surface of eyeHeat map of curvature (in radial direction)Hard to understand “population structure”
4444
UNC, Stat & OR
Cornea Data, II
PC 1Starts at Pop’n MeanOverall CurvatureVertical AstigmatismCorrelated!Gaussian ProjectionsVisualization: Can’t Overlay
(so use movie)
4545
UNC, Stat & OR
Cornea Data, III
PC 2Horrible Outlier!
(present in data)But look only in center:
Steep at top -- bottomWant Robust PCAFor HDLSS data ???
4646
UNC, Stat & OR
Cornea Data, IV
Robust PC 2No outlier impactSee top – bottom variationProjections now Gaussian
4747
UNC, Stat & OR
PCA for m-reps, I
Major issue: m-reps live in(locations, radius and angles)
E.g. “average” of: = ???
Natural Data Structure is:Lie Groups ~ Symmetric spaces
(smooth, curved manifolds)
)2()3(3 SOSO
359,358,3,2
4848
UNC, Stat & OR
PCA for m-reps, II
PCA on non-Euclidean spaces?(i.e. on Lie Groups / Symmetric Spaces)
T. Fletcher: Principal Geodesic Analysis
Idea: replace “linear summary of data”With “geodesic summary of data”…
4949
UNC, Stat & OR
PGA for m-reps, Bladder-Prostate-Rectum
Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3
(analysis by Ja Yeon Jeong)
5050
UNC, Stat & OR
PGA for m-reps, Bladder-Prostate-Rectum
Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3
(analysis by Ja Yeon Jeong)
5151
UNC, Stat & OR
PGA for m-reps, Bladder-Prostate-Rectum
Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3
(analysis by Ja Yeon Jeong)
5252
UNC, Stat & OR
HDLSS Classification (i.e. HDLSS Classification (i.e. Discrimination)Discrimination)
Background: Two Class (Binary) version:
Using “training data” from Class +1, and from Class -1
Develop a “rule” for assigning new data to a Class
Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements
5353
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio
Less Useful Methods: Nearest Neighbors Neural Nets
(“black boxes”, no “directions” or intuition)
5454
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Currently Fashionable Methods: Support Vector Machines Trees Based Approaches
New High Tech Method Distance Weighted Discrimination
(DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem
5555
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Currently Fashionable Methods:
Trees Based ApproachesSupport Vector Machines:
5656
UNC, Stat & OR
Kernel Embedding IdeaKernel Embedding Idea
Aizerman, Braverman, Rozoner (1964)
Make data linearly separableby embedding in
higher dimensional space
5757
UNC, Stat & OR
Kernel Embedding IdeaKernel Embedding Idea
Linearly separableby embedding inhigher dimensions
5858
UNC, Stat & OR
Kernel Embedding IdeaKernel Embedding Idea
Linearly separableby embedding inhigher dimensions
5959
UNC, Stat & OR
Kernel Embedding IdeaKernel Embedding Idea
Linearly separableby embedding inhigher dimensions
6060
UNC, Stat & OR
Kernel Embedding IdeaKernel Embedding Idea
Linearly separableby embedding inhigher dimensions
6161
UNC, Stat & OR
Kernel Embedding IdeaKernel Embedding Idea
Linearly separableby embedding inhigher dimensions
Distributional Assumptions
in Embedded Space?
ǁ˅
Support Vector Machine
6262
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Comparison of Linear Methods (toy data):
Optimal DirectionExcellent, but need dir’n in dim = 50
Maximal Data Piling (J. Y. Ahn, D. Peña)Great separation, but generalizability???
Support Vector MachineMore separation, gen’ity, but some data
piling?Distance Weighted Discrimination
Avoids data piling, good gen’ity, Gaussians?
50,20,2.2,, 21,1 dnnINd
6363
UNC, Stat & OR
Distance Weighted DiscriminationDistance Weighted Discrimination
Maximal Data Piling
6464
UNC, Stat & OR
Distance Weighted DiscriminationDistance Weighted Discrimination
Based on Optimization Problem:
More precisely work in appropriate penalty for violations
Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic
prog’ing Fast greedy solution Can use existing software
n
i ibw r1,
1min
6565
UNC, Stat & OR
Simulation ComparisonSimulation Comparison
E.G. Above Gaussians:Wide array of dim’sSVM Subst’ly worseMD – Bayes OptimalDWD close to MD
6666
UNC, Stat & OR
Simulation ComparisonSimulation Comparison
E.G. Outlier Mixture: Disaster for MD SVM & DWD much
more solid Dir’ns are “robust” SVM & DWD similar
6767
UNC, Stat & OR
Simulation ComparisonSimulation Comparison
E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly betterNote: All methods
come together for larger d ???
6868
UNC, Stat & OR
DWD Bias Adjustment for MicroarraysDWD Bias Adjustment for Microarrays
Microarray data: Simult. Measur’ts of “gene
expression” Intrinsically HDLSS
Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s
My view: Each array is “point in cloud”
6969
UNC, Stat & OR
DWD Batch and Source AdjustmentDWD Batch and Source Adjustment
For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004)
Bioinformaticshttps://genome.unc.edu/pubsup/dwd/
Adjust for Source Effects Different sources of mRNA
Adjust for Batch Effects Arrays fabricated at different times
7070
UNC, Stat & OR
DWD Adj: Raw Breast Cancer dataDWD Adj: Raw Breast Cancer data
7171
UNC, Stat & OR
DWD Adj: Source ColorsDWD Adj: Source Colors
7272
UNC, Stat & OR
DWD Adj: Batch ColorsDWD Adj: Batch Colors
7373
UNC, Stat & OR
DWD Adj: Biological Class ColorsDWD Adj: Biological Class Colors
7474
UNC, Stat & OR
DWD Adj: Biological Class Colors & DWD Adj: Biological Class Colors & SymbolsSymbols
7575
UNC, Stat & OR
DWD Adj: Biological Class SymbolsDWD Adj: Biological Class Symbols
7676
UNC, Stat & OR
DWD Adj: Source ColorsDWD Adj: Source Colors
7777
UNC, Stat & OR
DWD Adj: PC 1-2 & DWD directionDWD Adj: PC 1-2 & DWD direction
7878
UNC, Stat & OR
DWD Adj: DWD Source AdjustmentDWD Adj: DWD Source Adjustment
7979
UNC, Stat & OR
DWD Adj: Source Adj’d, PCA viewDWD Adj: Source Adj’d, PCA view
8080
UNC, Stat & OR
DWD Adj: Source Adj’d, Class ColoredDWD Adj: Source Adj’d, Class Colored
8181
UNC, Stat & OR
DWD Adj: Source Adj’d, Batch ColoredDWD Adj: Source Adj’d, Batch Colored
8282
UNC, Stat & OR
DWD Adj: Source Adj’d, 5 PCsDWD Adj: Source Adj’d, 5 PCs
8383
UNC, Stat & OR
DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWDDWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD
8484
UNC, Stat & OR
DWD Adj: S. & B1,2 vs. 3 AdjustedDWD Adj: S. & B1,2 vs. 3 Adjusted
8585
UNC, Stat & OR
DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCsDWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs
8686
UNC, Stat & OR
DWD Adj: S. & B Adj’d, B1 vs. 2 DWDDWD Adj: S. & B Adj’d, B1 vs. 2 DWD
8787
UNC, Stat & OR
DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’dDWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d
8888
UNC, Stat & OR
DWD Adj: S. & B Adj’d, 5 PC viewDWD Adj: S. & B Adj’d, 5 PC view
8989
UNC, Stat & OR
DWD Adj: S. & B Adj’d, 4 PC viewDWD Adj: S. & B Adj’d, 4 PC view
9090
UNC, Stat & OR
DWD Adj: S. & B Adj’d, Class ColorsDWD Adj: S. & B Adj’d, Class Colors
9191
UNC, Stat & OR
DWD Adj: S. & B Adj’d, Adj’d PCADWD Adj: S. & B Adj’d, Adj’d PCA
9292
UNC, Stat & OR
DWD Bias Adjustment for Microarrays
Effective for Batch and Source Adj. Also works for cross-platform Adj.
E.g. cDNA & Affy Despite literature claiming contrary
“Gene by Gene” vs. “Multivariate” views
Funded as part of caBIG“Cancer BioInformatics Grid”
“Data Combination Effort” of NCI
9393
UNC, Stat & OR
Interesting Benchmark Data SetInteresting Benchmark Data Set
NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available:
http://discover.nci.nih.gov/datasetsNature2000.jsp
Both cDNA and Affymetrix Platforms
8 Major cancer subtypes
Use DWD now for visualization
9494
UNC, Stat & OR
NCI 60: Fully Adjusted Data, NCI 60: Fully Adjusted Data, Leukemia ClusterLeukemia Cluster
LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR
9595
UNC, Stat & OR
NCI 60: Views using DWD Dir’ns (focus on NCI 60: Views using DWD Dir’ns (focus on biology)biology)
9696
UNC, Stat & OR
Why not adjust by means?
DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust
(although still not perfect)
9797
UNC, Stat & OR
Twiddle ratios of subtypes
9898
UNC, Stat & OR
DWD in Face Recognition, I
Face Images as Data(with M. Benito & D. Peña)Registered using landmarksMale – Female Difference?Discrimination Rule?
9999
UNC, Stat & OR
DWD in Face Recognition, II
DWD Direction Good separation Images “make
sense” Garbage at ends?
(extrapolation effects?)
100100
UNC, Stat & OR
DWD in Face Recognition, III
Interesting summary: Jump between means
(in DWD direction) Clear separation of
Maleness vs. Femaleness
101101
UNC, Stat & OR
DWD in Face Recognition, IV
Fun Comparison: Jump between means
(in SVM direction) Also distinguishesMaleness vs. Femaleness But not as well as
DWD
102102
UNC, Stat & OR
DWD in Face Recognition, V
Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)
103103
UNC, Stat & OR
DWD in Face Recognition, VI
Current Work: Focus on “drivers”:
(regions of interest) Relation to Discr’n? Which is “best”? Lessons for human
perception?
104104
UNC, Stat & OR
Time Series of Curves
Chemical Spectra, evolving over time(with J. Wendelberger & E. Kober)
Mortality curves changing in time(with Andres Alonzo)
105105
UNC, Stat & OR
Discrimination for m-reps
Classification for Lie Groups – Symm. SpacesS. K. Sen, S. Joshi & M. Foskey
What is “separating plane” (for SVM-DWD)?
106106
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
107107
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
108108
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
109109
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
110110
UNC, Stat & OR
Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
Blood vessel tree dataBlood vessel tree data
111111
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
112112
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Now look over many people (data objects)Structure of population (understand variation?)PCA in strongly non-Euclidean Space???
, ... ,,
113113
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Possible focus of analysis:• Connectivity structure only (topology)• Location, size, orientation of segments• Structure within each vessel segment
, ... ,,
114114
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Present Focus:Topology only
Already challenging Later address others Then add attributes To tree nodes And extend analysis
115115
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
The tree team: Very Interdsciplinary Neurosurgery: Bullitt, Ladha
Statistics: Wang, Marron
Optimization: Aydin, Pataki
116116
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Recall from above:Marron’s brain: Focus on back Connectivity (topology) only
117117
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Present Focus: Topology only Raw data as trees Marron’s reduced tree Back tree only
118118
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Topology onlyE.g. Back TreesFull PopulationStudy as movieUnderstand variation?
119119
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
Statistics on Population of Tree-Structured Data Objects?
• Mean???• Analog of PCA???
Strongly non-Euclidean, since:• Space of trees not a linear space• Not even approximately linear
(no tangent plane)
120120
UNC, Stat & OR
Mildly Non-Euclidean Mildly Non-Euclidean SpacesSpaces
Useful View of Manifold Data: Tangent Space
Center:Frechét Mean
Reason forterminology“mildly nonEuclidean”
121121
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
Mean of Population of Tree-Structured Data Objects?
Natural approach: Fréchet mean
Requires a metric (distance)on tree space
n
ii
xxXdX
1
2,minarg
122122
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space?• Recall Conventional PCA:• Directions that explain structure in
data
• Data are points in point cloud• 1-d and 2-d projections allow insights
about population structure
123123
UNC, Stat & OR
Illust’n of PCA View: PC1 Illust’n of PCA View: PC1 ProjectionsProjections
124124
UNC, Stat & OR
Illust’n of PCA View: Projections on PC1,2 Illust’n of PCA View: Projections on PC1,2 planeplane
125125
UNC, Stat & OR
Source Batch Adj: PC 1-3 & DWD Source Batch Adj: PC 1-3 & DWD directiondirection
126126
UNC, Stat & OR
Source Batch Adj: DWD Source Source Batch Adj: DWD Source AdjustmentAdjustment
127127
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space?Key Idea (Jim Ramsay):• Replace 1-d subspace
that best approximates data• By 1-d representation
that best approximates dataWang and Marron (2007) define notion of
Treeline (in structure space)
128128
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space: Treeline • Best 1-d representation of dataBasic idea:• From some starting tree• Grow only in 1 “direction”
129129
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space: Treeline • Best 1-d representation of dataProblem: Hard to compute• In particular: to solve optimization problemWang and Marron (2007)• Maximum 4 vessel trees• Hard to tackle serious trees
(e.g. blood vessel trees)
130130
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space: Treeline Problem: Hard to computeSolution: Burcu Aydin & Gabor Pataki
(linear time algorithm)(based on clever
“reformulation” of problem)
131131
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
PCA on Tree Space: Treelines Interesting to compare:• Population of Left Trees• Population of Right Trees• Population of Back TreesAnd to study 1st, 2nd, 3rd & 4th treelines
132132
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Study“Directions”
1, 2, 3, 4For sub-populations
B, L, R(interpret
later)
133133
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Notes on Treeline Directions:• PC1 always to left• BACK has most variation to right
(PC2)• LEFT has more varia’n to 2nd level
(PC2)• RIGHT has more var’n to 1st level
(PC2)See these in the data?
134134
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Notes:PC1 – all leftPC2:BACK - right LEFT 2nd levRIGHT 1st levSee these??
135135
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space: Treeline Next represent data as projections• Define as closest point in tree line
(same as Euclidean PCA)• Have corresponding score
(length of projection along line)• And analog of residual
(distance from data point to projection)
136136
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Individual (each PC separately) Scores Plot
137137
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Data Analytic Goals: Age, Gender
See these?
No…
138138
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Directly study age PC scoresPC1 + PC2- Thickness Not Sig’t- Descendants Left Sig’t
139139
UNC, Stat & OR
Upcoming New ApproachUpcoming New Approach
Replace Tree-Lines by Tree-Curves:
140140
UNC, Stat & OR
Upcoming New ApproachUpcoming New Approach
Projections on Tree-Curves:
141141
UNC, Stat & OR
Preliminary Tree-Curve Preliminary Tree-Curve ResultsResults
First Correlation
OfStructure
To Age!
(BackTrees)
142142
UNC, Stat & OR
Preliminary Tree-Curve Preliminary Tree-Curve ResultsResults
But does not appeareverywhere
(LeftTrees)
Findinglocality!
143143
UNC, Stat & OR
HDLSS Asymptotics
Why study asymptotics?
144144
UNC, Stat & OR
HDLSS Asymptotics
Why study asymptotics? An interesting (naïve) quote:“I don’t look at asymptotics, because I don’t have an infinite sample size”
145145
UNC, Stat & OR
HDLSS Asymptotics
Why study asymptotics? An interesting (naïve) quote:
“I don’t look at asymptotics, because I don’t have an infinite sample size”
Suggested perspective:Asymptotics are a tool for finding simple
structure underlying complex entities
146146
UNC, Stat & OR
HDLSS Asymptotics
Which asymptotics? n ∞ (classical, very widely done) d ∞ ??? Sensible? Follow typical “sampling process”? Say anything, as noise level increases???
147147
UNC, Stat & OR
HDLSS Asymptotics
Which asymptotics? n ∞ & d ∞ n >> d: a few results around
(still have classical info in data) n ~ d: random matrices (Iain J., et al)
(nothing classically estimable) HDLSS asymptotics: n fixed, d ∞
148148
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”?
149149
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”? Microarrays: # genes bounded Proteomics, SNPs, …
A moot point, from perspective: Asymptotics are a tool for finding simple
structure underlying complex entities
150150
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases???
151151
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases???
Yes, there exists simple, perhapssurprising, underlying structure
152152
UNC, Stat & OR
HDLSS Asymptotics: Simple Paradoxes, I
For dim’al “Standard Normal” dist’n:
Euclidean Distance to Origin (as ):
- Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”??? - Paradox resolved by:
“density w. r. t. Lebesgue Measure”
d
d
dd
d
INZ
ZZ ,0~
1
)1(pOdZ
d
153153
UNC, Stat & OR
HDLSS Asymptotics: Simple Paradoxes, II
For dim’al “Standard Normal” dist’n: indep. of
Euclidean Dist. between and (as ):Distance tends to non-random constant:
Can extend to Where do they all go???
(we can only perceive 3 dim’ns)
d
d
dd INZ ,0~2
)1(221 pOdZZ
1Z
1Z 2Z
nZZ ,...,1
154154
UNC, Stat & OR
HDLSS Asymptotics: Simple Paradoxes, III
For dim’al “Standard Normal” dist’n: indep. of
High dim’al Angles (as ):
- -“Everything is orthogonal”??? - Where do they all go???
(again our perceptual limitations) - Again 1st order structure is non-random
d
d
dd INZ ,0~2
)(90, 2/121
dOZZAngle p
1Z
155155
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, I
Assume , let Study Subspace Generated by
Dataa. Hyperplane through 0, of
dimension b. Points are “nearly
equidistant to 0”, & dist c. Within plane, can “rotate
towards Unit Simplex”d. All Gaussian data sets
are“near Unit Simplex Vertices”!!!
“Randomness” appears only in rotation of simplex
n
d ddn INZZ ,0~,...,1
d
d
With P. Hall & A. Neeman
156156
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, II
Assume , let
Study Hyperplane Generated by Data
a. dimensional hyperplane
b. Points are pairwise equidistant, dist
c. Points lie at vertices of “regular hedron”
d. Again “randomness in data” is only in rotation
e. Surprisingly rigid structure in data?
1n
d ddn INZZ ,0~,...,1
d2d~
n
157157
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, III
Simulation View: shows “rigidity after rotation”
158158
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, III
Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” (with P. Hall & A. Neeman)
Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi)
Mixing Condition on Stand’d & Permuted Var’s
(with S. Jung)
All based on simple “Laws of Large Numbers”
159159
UNC, Stat & OR
2nd Paper on HDLSS Asymptotics
Ahn, Marron, Muller & Chi (2007) Biometrika Assume 2nd Moments (and Gaussian) Assume no eigenvalues too large in sense:
For assume i.e. (min possible)
(much weaker than previous mixing conditions…)
d
jj
d
jj
d1
2
2
1
)(1 do 1 d
160160
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, IV
Explanation of Observed (Simulation) Behavior:“everything similar for very high d”
2 popn’s are 2 simplices (i.e. regular n-hedrons)
All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN
161161
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, V
Further Consequences of Geometric Representation
1. Inefficiency of DWD for uneven sample size(motivates “weighted version”, work in progress)
2. DWD more “stable” than SVM(based on “deeper limiting distributions”)(reflects intuitive idea “feeling sampling
variation”)(something like “mean vs. median”)
3. 1-NN rule inefficiency is quantified.
162162
UNC, Stat & OR
HDLSS Math. Stat. of PCA, I
Consistency & Strong Inconsistency:Spike Covariance Model (Johnstone & Paul)For Eigenvalues: 1st Eigenvector:
How good are empirical versions,as estimates?
1,,1, ,,2,1 dddd d
1u
1,,1 ˆ,ˆ,,ˆ uddd
163163
UNC, Stat & OR
HDLSS Math. Stat. of PCA, II
Consistency (big enough spike):For ,
Strong Inconsistency (spike not big enough):For ,
1
0ˆ, 11 uuAngle
1
011 90ˆ, uuAngle
164164
UNC, Stat & OR
HDLSS Math. Stat. of PCA, III
Consistency of eigenvalues?
Eigenvalues Inconsistent But known distribution Unless as well
nn
dL
d
2
,1,1̂
n
165165
UNC, Stat & OR
HDLSS Work in Progress, I
Batch Adjustment: Xuxin LiuRecall Intuition from above: Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust
Mathematics behind this?
166166
UNC, Stat & OR
Liu: Twiddle ratios of subtypes
167167
UNC, Stat & OR
HDLSS Data Combo Mathematics
Xuxin Liu Dissertation Results: Simple Unbalanced Cluster Model Growing at rate as Answers depend on
Visualization of setting….
d d
168168
UNC, Stat & OR
HDLSS Data Combo Mathematics
169169
UNC, Stat & OR
HDLSS Data Combo Mathematics
170170
UNC, Stat & OR
HDLSS Data Combo Mathematics
Asymptotic Results (as ):
For , DWD Consistent
Angle(DWD,Truth)
For , DWD Strongly Inconsistent
Angle(DWD,Truth)
d
21
21
0
090
171171
UNC, Stat & OR
HDLSS Data Combo Mathematics
Asymptotic Results (as ):
For , PAM Inconsistent
Angle(PAM,Truth)
For , DWD Strongly Inconsistent
Angle(PAM,Truth)
d
21
21
0 rC
090
172172
UNC, Stat & OR
HDLSS Data Combo Mathematics
Value of , for sample size ratio :
, only when
Otherwise for , PAM Inconsistent
Verifies intuitive idea in strong way
rC
22
1cos2
1
r
rCr
0rC
r
1r
1r
173173
UNC, Stat & OR
HDLSS Work in Progress, II
Canonical Correlations: Myung Hee Lee Results similar to those for those for
PCA Singular values inconsistent But directions converge under a much
milder spike assumption.
174174
UNC, Stat & OR
HDLSS Work in Progress, III
Conditions for Geo. Rep’n & PCA Consist.:John Kent example:
Can only say: not deterministic
Conclude: need some flavor of mixing
dddddd ININX *100,021,0
21~
212/1212/1
2/1
..10
..)(
pwdpwd
dOX p
175175
UNC, Stat & OR
HDLSS Work in Progress, III
Conditions for Geo. Rep’n & PCA Consist.:Conclude: need some flavor of mixing
Challenge: Classical mixing conditionsrequire notion of time ordering
Not always clear, e.g. microarrays
176176
UNC, Stat & OR
HDLSS Work in Progress, III
Conditions for Geo. Rep’n & PCA Consist.:Sungkyu Jung Condition: whereDefine:Assume: Ǝ a permutation, So that is ρ-mixing
ddX ,0~ tdddd UU
dtddd XUZ 2/1
d
ddZ
177177
UNC, Stat & OR
HDLSS Deep Open Problem
In PCA Consistency: Strong Inconsistency - spike Consistency - spike
What happens at boundary ( )???
1
1
1
178178
UNC, Stat & OR
The Future of HDLSS Asymptotics?
1. Address your favorite statistical problem…2. HDLSS versions of classical optimality results?3. Continguity Approach (~Random Matrices)4. Rates of convergence?5. Improved Discrimination Methods?
It is early days…
179179
UNC, Stat & OR
The Future of Geometrical Representation?
HDLSS version of “optimality” results? “Contiguity” approach? Params depend on
d? Rates of Convergence? Improvements of DWD?(e.g. other functions of distance than inverse)
It is still early days …
180180
UNC, Stat & OR
Some Carry Away Lessons
Atoms of the Analysis: Object Oriented Viewpoint: Object Space Feature Space DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations
(Modulo rotation, have constant simplex shape) How to put HDLSS asymptotics to work?
181181
UNC, Stat & OR
Object Oriented Data AnalysisObject Oriented Data Analysis
Potential Future Opportunity:
OODA SAMSI Program2010-2011
Interested in joining? Let’s talk