42
1 UNC, Stat & OR Hailuoto Workshop Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina June 23, 2022

1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

Embed Size (px)

Citation preview

Page 1: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

11

UNC, Stat & OR

Hailuoto WorkshopHailuoto Workshop

Object Oriented Data Analysis, II

J. S. Marron

Dept. of Statistics and Operations

Research, University of North Carolina

April 20, 2023

Page 2: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

22

UNC, Stat & OR

HDLSS Classification (i.e. Discrimination)

Background: Two Class (Binary) version:

Using “training data” from Class +1, and from Class -1

Develop a “rule” for assigning new data to a Class

Canonical Example: Disease Diagnosis New Patients are “Healthy” of “Ill” Determined bases on measurements

Page 3: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

33

UNC, Stat & OR

HDLSS Classification (Cont.)

Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio

Less Useful Methods: Nearest Neighbors Neural Nets

(“black boxes”, no “directions” or intuition)

Page 4: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

44

UNC, Stat & OR

HDLSS Classification (Cont.)

Currently Fashionable Methods: Support Vector Machines Trees Based Approaches

New High Tech Method Distance Weighted Discrimination

(DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem

Page 5: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

55

UNC, Stat & OR

HDLSS Classification (Cont.)

Currently Fashionable Methods:

Trees Based ApproachesSupport Vector Machines:

Page 6: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

66

UNC, Stat & OR

HDLSS Classification (Cont.)

Comparison of Linear Methods (toy data):

Optimal DirectionExcellent, but need dir’n in dim = 50

Maximal Data Piling (J. Y. Ahn, D. Peña) Great separation, but generalizability???

Support Vector Machine More separation, gen’ity, but some data

piling?Distance Weighted Discrimination

Avoids data piling, good gen’ity, Gaussians?

50,20,2.2,, 21,1 dnnINd

Page 7: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

77

UNC, Stat & OR

Distance Weighted Discrimination

Maximal Data Piling

Page 8: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

88

UNC, Stat & OR

Distance Weighted Discrimination

Based on Optimization Problem:

More precisely work in appropriate penalty for violations

Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic

prog’ing Fast greedy solution Can use existing software

n

i ibw r1,

1min

Page 9: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

99

UNC, Stat & OR

Simulation Comparison

E.G. Above

Gaussians:

Wide array of dim’s

SVM Subst’ly worse

MD – Bayes Optimal

DWD close to MD

Page 10: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1010

UNC, Stat & OR

Simulation Comparison

E.G. Outlier Mixture:

Disaster for MD

SVM & DWD much

more solid

Dir’ns are “robust”

SVM & DWD similar

Page 11: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1111

UNC, Stat & OR

Simulation Comparison

E.G. Wobble Mixture:

Disaster for MD

SVM less good

DWD slightly better

Note: All methods

come together for

larger d ???

Page 12: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1212

UNC, Stat & OR

DWD Bias Adjustment for Microarrays

Microarray data: Simult. Measur’ts of “gene

expression” Intrinsically HDLSS

Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s

My view: Each array is “point in cloud”

Page 13: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1313

UNC, Stat & OR

DWD Batch and Source AdjustmentDWD Batch and Source Adjustment

For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004)

Bioinformaticshttps://genome.unc.edu/pubsup/dwd/

Adjust for Source Effects Different sources of mRNA

Adjust for Batch Effects Arrays fabricated at different times

Page 14: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1414

UNC, Stat & OR

DWD Adj: Raw Breast Cancer dataDWD Adj: Raw Breast Cancer data

Page 15: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1515

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

Page 16: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1616

UNC, Stat & OR

DWD Adj: Batch ColorsDWD Adj: Batch Colors

Page 17: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1717

UNC, Stat & OR

DWD Adj: Biological Class ColorsDWD Adj: Biological Class Colors

Page 18: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1818

UNC, Stat & OR

DWD Adj: Biological Class Colors & DWD Adj: Biological Class Colors & SymbolsSymbols

Page 19: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

1919

UNC, Stat & OR

DWD Adj: Biological Class SymbolsDWD Adj: Biological Class Symbols

Page 20: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2020

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

Page 21: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2121

UNC, Stat & OR

DWD Adj: PC 1-2 & DWD directionDWD Adj: PC 1-2 & DWD direction

Page 22: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2222

UNC, Stat & OR

DWD Adj: DWD Source AdjustmentDWD Adj: DWD Source Adjustment

Page 23: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2323

UNC, Stat & OR

DWD Adj: Source Adj’d, PCA viewDWD Adj: Source Adj’d, PCA view

Page 24: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2424

UNC, Stat & OR

DWD Adj: Source Adj’d, Class ColoredDWD Adj: Source Adj’d, Class Colored

Page 25: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2525

UNC, Stat & OR

DWD Adj: Source Adj’d, Batch ColoredDWD Adj: Source Adj’d, Batch Colored

Page 26: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2626

UNC, Stat & OR

DWD Adj: Source Adj’d, 5 PCsDWD Adj: Source Adj’d, 5 PCs

Page 27: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2727

UNC, Stat & OR

DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWDDWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

Page 28: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2828

UNC, Stat & OR

DWD Adj: S. & B1,2 vs. 3 AdjustedDWD Adj: S. & B1,2 vs. 3 Adjusted

Page 29: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

2929

UNC, Stat & OR

DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCsDWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

Page 30: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3030

UNC, Stat & OR

DWD Adj: S. & B Adj’d, B1 vs. 2 DWDDWD Adj: S. & B Adj’d, B1 vs. 2 DWD

Page 31: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3131

UNC, Stat & OR

DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’dDWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d

Page 32: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3232

UNC, Stat & OR

DWD Adj: S. & B Adj’d, 5 PC viewDWD Adj: S. & B Adj’d, 5 PC view

Page 33: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3333

UNC, Stat & OR

DWD Adj: S. & B Adj’d, 4 PC viewDWD Adj: S. & B Adj’d, 4 PC view

Page 34: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3434

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Class ColorsDWD Adj: S. & B Adj’d, Class Colors

Page 35: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3535

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Adj’d PCADWD Adj: S. & B Adj’d, Adj’d PCA

Page 36: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3636

UNC, Stat & OR

DWD Bias Adjustment for Microarrays

Effective for Batch and Source Adj. Also works for cross-platform Adj.

E.g. cDNA & Affy Despite literature claiming contrary

“Gene by Gene” vs. “Multivariate” views

Funded as part of caBIG“Cancer BioInformatics Grid”

“Data Combination Effort” of NCI

Page 37: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3737

UNC, Stat & OR

Why not adjust by means?

DWD is complicated: value added?

Xuxin Liu example…

Key is sizes of biological subtypes

Differing ratio trips up mean

But DWD more robust

(although still not perfect)

Page 38: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3838

UNC, Stat & OR

Twiddle ratios of subtypes

Page 39: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

3939

UNC, Stat & OR

DWD in Face Recognition, I

Face Images as Data

(with M. Benito & D. Peña)

Registered using

landmarks

Male – Female Difference?

Discrimination Rule?

Page 40: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

4040

UNC, Stat & OR

DWD in Face Recognition, II

DWD Direction

Good separation

Images “make

sense”

Garbage at ends?

(extrapolation

effects?)

Page 41: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

4141

UNC, Stat & OR

DWD in Face Recognition, III

Interesting summary:

Jump between means

(in DWD direction)

Clear separation of

Maleness vs.

Femaleness

Page 42: 1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina

4242

UNC, Stat & OR

DWD in Face Recognition, IV

Current Work:

Focus on “drivers”:

(regions of interest)

Relation to Discr’n?

Which is “best”?

Lessons for human

perception?