30
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 1 Multivariate Analysis Multivariate Analysis Past, Present and Future Past, Present and Future Harrison B. Prosper Florida State University PHYSTAT 2003 PHYSTAT 2003 10 September 2003

Multivariate Analysis Past, Present and Future

  • Upload
    fai

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Multivariate Analysis Past, Present and Future. Harrison B. Prosper Florida State University PHYSTAT 2003 10 September 2003. Outline. Introduction Historical Note Current Practice Issues Summary. Introduction. Data are invariably multivariate Particle physics( h , f , E, f) - PowerPoint PPT Presentation

Citation preview

Page 1: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 1

Multivariate AnalysisMultivariate AnalysisPast, Present and Future Past, Present and Future

Harrison B. ProsperFlorida State University

PHYSTAT 2003PHYSTAT 200310 September 2003

Page 2: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 2

OutlineOutline

Introduction Historical Note Current Practice Issues Summary

Page 3: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 3

IntroductionIntroduction

Data are invariably multivariate

Particle physics (, , E, f)

Astrophysics (θ, , E, t)

Page 4: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 4

Introduction – II Introduction – II A Textbook ExampleA Textbook Example

Objects Jet 1 (b) 3 Jet 2 3 Jet 3 3 Jet 4 (b) 3 Positron 3 Neutrino 2

17

Page 5: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 5

Introduction – IIIIntroduction – III

Astrophysics/Particle physics: Similarities Events Interesting events occur at random Poisson processes Backgrounds are important Experimental response functions Huge datasets

Page 6: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 6

Introduction – IVIntroduction – IV

Differences In particle physics we control when

events occur and under what conditions

We have detailed predictions of the relative frequency of various outcomes

Page 7: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 7

Introduction – VIntroduction – VAll we do is Count!All we do is Count!

Our experiments are ideal Bernoulli trials At Fermilab, each collision, that is, trial, is

conducted the same way every 400ns400ns

de Finetti’s analysis of exchangeable trials is an de Finetti’s analysis of exchangeable trials is an accurate model of what we doaccurate model of what we do

)()(,

),(

)(),,(),...,(1

01

pfnk

npkPoisson

dfnkBinomialeeP n

)()(,

),(

)(),,(),...,(1

01

pfnk

npkPoisson

dfnkBinomialeeP n

Time →

Page 8: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 8

Introduction – VIIntroduction – VI

Typical analysis tasks Data Compression Clustering and cluster characterization Classification/Discrimination Estimation Model selection/Hypothesis testing

Optimization

Page 9: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 9

Historical NoteHistorical NoteKarl Pearson (1857 – 1936)

P.C. Mahalanobis (1893 – 1972)

R.A. Fisher (1890 – 1962)

Page 10: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 10

Historical Note – Iris DataHistorical Note – Iris Data

Iris Sotosa

Iris Versicolor

R.A. Fisher, The Use of Multiple Measurements in Taxonomic Problems,Annals of Eugenics, v. 7, p. 179-188 (1936)

Page 11: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 11

Iris DataIris Data

Variables X1 Sepal length X2 Sepal width X3 Petal length X4 Petal width

“What linear function of the four measurements will maximize the ratio of the difference between the specific means to the standard deviations within species?” R.A. Fisher

Page 12: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 12

Fisher Linear Discriminant (1936)Fisher Linear Discriminant (1936)

xy BA 1)( xy BA 1)(

4321 1036.101299.79037.5 xxxxy 4321 1036.101299.79037.5 xxxxy

Solution:

bxw

xGaussian

xGaussian

)()(

,|

,|log

12

22

2

1

Which is the same, within a constant, as

Page 13: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 13

Current Practice in Particle PhysicsCurrent Practice in Particle Physics

Reducing number of variables Principal Component Analysis (PCA)

Discrimination/Classification Fisher Linear Discriminant (FLD) Random Grid Search (RGS) Feedforward Neural Network (FNN) Kernel Density Estimation (KDE)

Page 14: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 14

Current Practice – IICurrent Practice – II

Parameter Estimation Maximum Likelihood (ML) Bayesian (KDE and analytical methods)

e.g., see talk by Florencia Canelli (12A)

Weighting Usually 0, 1, referred to as “cutscuts” Sometimes use the R. Barlow method

Page 15: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 15

Points that liebelow the cutsare “cut out”

Cuts (0, 1 weights)Cuts (0, 1 weights)

We refer to ((xx00, , yy00))as a cut-pointcut-point

S = B =

0

0

yy

xx

0y0y

x0x0yy

xx

0011

Page 16: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 16

Apply cuts at each grid point

Grid Search Grid Search

x

yx x

y yi

i

Curse of dimensionality: number of cut-points ~ NbinNbinNdimNdim

S = B =

compute some measure of theireffectivenessand choose mosteffective cuts

Page 17: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 17

Random Grid SearchRandom Grid SearchS

igna

l fra

ctio

n

Background fraction

0

0

1

1

n = # events in samplek = # events after cutsfraction = n/k

Take each point each point ofthe signal classsignal class as a cut-pointa cut-point x x

y yi

i

H.B.P. et al, Proceedings, CHEP 1995 x

y

Page 18: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 18

Example: DExample: DØ Ø Top Discovery (1995)Top Discovery (1995)

Page 19: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 19

Optimal Discrimination Optimal Discrimination

xx

yy

r(x,y) = constantconstant defines the optimaldecision boundarydecision boundary

r p x y s( , | ) p s( )p x y s( , | ) p s( )BayesBayes

DiscriminantDiscriminant

Page 20: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 20

FeedForward Neural NetworksFeedForward Neural Networks

Applications Discrimination Parameter estimation Function and density estimation

Basic Idea Encode mapping (Kolmogorov, 1950s).

using a set of 1-D functions.],..,[)(: 1

1K

N FxfUUf

Page 21: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 21

Example: Example: DDØØ Search for LeptoQuarksSearch for LeptoQuarks

q

g

LQ

q

q

l

LQ

Page 22: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 22

IssuesIssues

Method choice Life is short and data finite; so how

should one choose a method?

Model complexity How to reduce dimensionality of data,

while minimizing loss of “information”? How many model parameters? How should one avoid over-fitting?

Page 23: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 23

Issues – IIssues – III

Model robustness Is a cut on a multivariate discriminant

necessarily more sensitive to modeling errors than a cut on each of its input variables?

What is a practical, but useful, way to assess sensitivity to modeling errors and robustness with respect to assumptions?

Page 24: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 24

Issues - IIIIssues - III

Accuracy of predictions How should one place “error bars” on

multivariate-based results? Is a Bayesian approach useful?

Goodness of fit How can this be done in multiple

dimensions?

Page 25: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 25

SummarySummary

After ~ 80 years of effort we have many powerful methods of analysis

A few of which are now used routinely in physics analyses

The most pressing need is to understand some issues better so that when the data tsunami strikes we can respond sensibly

Page 26: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 26

Minimize the empirical risk function with respect to

i

iiN xntR 21 )],([)(

FNN – Probabilistic InterpretationFNN – Probabilistic Interpretation

Solution (for large N)

dtxtpxtxn )|()(),( dtxtpxtxn )|()(),(

k

kpkxpkpkxpxkpxn )()|(/)()|()|(),( k

kpkxpkpkxpxkpxn )()|(/)()|()|(),( If t(x) = k[1I(x)], where I(x) = 1 if x is of class k, 0 otherwise

D.W. Ruck et al., IEEE Trans. Neural Networks 1(4), 296-298 (1990)E.A. Wan, IEEE Trans. Neural Networks 1(4), 303-305 (1990)

Page 27: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 27

Self Organizing MapSelf Organizing Map

Basic Idea (Kohonen, 1988) Map each of K feature vectors X =

(x1,..,xN)T into one of M regions of interest defined by the vector wm so that all X mapped to a given wm are closer to it than to all remaining wm.

Basically, perform a coarse-graining of the feature space.

Page 28: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 28

Support Vector MachinesSupport Vector Machines

Basic Idea Data that are non-separable in N-

dimensions have a higher chance of being separable if mapped into a space of higher dimension

Use a linear discriminant to partition the high dimensional feature space.

bxwxD )()(

HugeN :

Page 29: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 29

Independent Component AnalysisIndependent Component Analysis

Basic Idea Assume X = (x1,..,xN)T is a linear sum X

= AS of independent sources S = (s1,..,sN)T. Both A, the mixing

matrix, and S are unknown. Find a de-mixing matrix T such that the

components of U = TX are statistically independent

Page 30: Multivariate Analysis Past, Present and Future

Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper 30