SFT group meeting 1 Phystat05 : Trip Report A. Kreshuk, L. Moneta

SFT group meeting 1

Phystat05 : Trip Report

A. Kreshuk, L. Moneta

SFT group meeting 2

Phystat History Started in Jan. 2000 at CERN Workshop on Confidence Limits

organized by F. James and L. Lyons Only particle physicists

Fermilab (2000) Still focused on limits

Durham (2002) wider range of statistical topics in HEP (also partons

dist.)

SLAC (2003 ) partecipation from Astronomists and many statisticians

SFT group meeting 3

Phystat05 Oxford 12-15 September 2005

Treated various topics related to statistics (including software) Contributions with people from the high energy physics,

astronomy and statistics community

~ 80 people

SFT group meeting 4

Conference ProgramPlenary sections

half physicistshalf statisticians

SFT group meeting 5

Conference Program (2)Parallel sections (Monday + Wednesday afternoons)

Software

Event classification

Limits

SFT group meeting 6

Conference TopicsFrequentist vs baysianConfidence Limits

Nuisance parameters problemMultivariate analysis (event classification)Statistical software and toolsAstrophysics Goodness of Fit Unfolding, Time Series,...

SFT group meeting 7

Frequentist vs BaysianNice review from Sir Cox (Oxford)

Frequentist and Baysian approach to statistical inference

Problems working with Baysian analysisLeDiberder (BaBar)

Analysis of B Problems with prior choice

frequentist

Bayesian (II)

SFT group meeting 8

Nuisance Parameters Problem with statistical treatment of

uncertainties in nuisance parameter Typical problem:

Nobs = * L * A + bUncertainty in background and acceptance affect estimate

of physical parameter .

Statistical uncertainties Number of events in side bands

Systematic uncertainties Shape of background

Coverage of these parameters is required in a frequentist analysis

Importance for LHC ( see Kyle Krammer talk)

SFT group meeting 9

Kyle Cranmer : Statistical Challenges of the LHC

Gary Feldman PHYSTAT 05 15 September 2005 10

Why 5 ?

LHC searches: 500 searches each of which has 100 resolution elements (mass, angle bins, etc.) x 5 x 104 chances to find something.

One experiment: False positive rate at 5 (5 x 104) (3 x 10-7) = 0.015. OK.

Two experiments: Allowable false positive rate: 10. 2 (5 x 104) (1 x 10-4) = 10 3.7 required. Required other experiment verification:

(1 x 10-3)(10) = 0.01 3.1 required. Caveats: Is the significance real? Are there common

systematic errors?

SFT group meeting 11

Setting Limits with Nuisance parameters

Various techniques presented to set limits with nuisance parameters Baysian methods (used by CDF) Profile likelihood (Rolke)

Method used in MINUIT (Minos) Full Neyman construction (Punzi, Cranmer)

Importance to check coverage whatever method is chosen Important for claiming 5 discoveries at LHC Comparison with Cousins-Highland technique used at

LEP

14th October SFT group meeting 12





Bayesian with Coverage

Joel Heinrich presented a decision by CDF to do Bayesian analyses with priors that cover. Advantage is Bayesian conditioning with frequentist coverage. Possibly the maximum amount of work for the experimenter.

Example of coveragewith a single Poisson with normalization and background nuisance parameters:

Flat priors


Profile Likelihood Method Rolke:

eliminating the nuisance parameters via profile likelihood Neyman construction replaced by the

-lnL hill-climbing approximation. Same method present in MINUIT (MINOS) The coverage is good with some minor undercoverage. Present also in ROOT in class TRolke

Bkg rate

sign

al r

ate

signal rate



Full Neyman Constructions

Both Giovanni Punzi and Kyle Cranmer attempted full Neyman constructions for both signal and nuisance parameters.

I don’t recommend you try this at home for the following reasons:

The ordering principle is not unique. Both Punzi and Cranmer ran into some problems.

The technique is not feasible for more than a few nuisance parameters.

It is unnecessary since removing the nuisance parameters through profile likelihood works quite well.



Event Classification

The problem: Given a measurement of an event X = (x1,x2,…xn), find the function F(X) which returns 1 if the event is signal (s) and 0 if the event is background (b) to optimize a figure of merit, say

signal.

s b for discovery or s s+b for an established


Theoretical Solution

In principle the solution is straightforward: Use a Monte Carlo simulation to calculate the likelihood ratio Ls(X)/Lb(X) and derive F(X) from it. By the Neyman-Pearson Theorem, this is the optimum solution.

Unfortunately, this does not work due to the “curse of dimensionality.” In a high-dimension space, even the largest data set is sparse with the distance between neighboring events comparable to the radius of the space.


Practical Solutionuse brute force from computers.One gives the computer samples of signal

and background events and lets the computer figure out what F(X) is.Artificial Neural networksDecision Trees

Interested sparked by J. Friedman talk at Phystat03Recent techniques to increase decision power by

combining effectively many trees i.e. Boosted decision trees


Decision Tree

• Go through all PID variables and find best variable and value to split events.

• For each of the two subsets repeat the process

• Proceeding in this way a tree is built.

• Ending nodes are called leaves.

Background/Signal


Rules and Bagging Trees

Jerry Friedman gave a talk on rules, which effectively combines a series of trees.

Harrison Prosper gave a talk (for Ilya Narsky) on bagging (Bootstrap AGGregatING) trees. In this technique, one builds a collection of trees by selecting a sample of the training data and, optionally, a subset of the variables.

Results on significance of B e at BaBar

Single decision tree 2.16 Boosted decision trees 2.62 (not optimized)

Bagging decision trees 2.99


Boosted Decision Trees use of boosted trees in MiniBooNE (B. Roe) Misclassified events in one tree are given a higher weight

and a new tree is generated. Repeat to generate 1000 trees. The final classifier is a weighted sum of all of the trees.

Comparison with neural networks (ANN): Boosting

better than ANN by 1.2-1.8

More robust

% of signal retained

52 variables

21 variables

ANN/ Boosted Trees bkg events

SFT group meetingPhystat 2005 27

14th OctoberHarrison Prosper

StatPatternRecognition: A C++ package for multivariate classification

Implemented classifiers and algorithms: binary split linear and quadratic discriminant analysis decision trees bump hunting algorithm (PRIM, Friedman & Fisher) AdaBoost bagging and random forest algorithms AdaBoost and Bagger are capable of

boosting/bagging any classifier implemented in the package

Described in: I. Narsky, physics/0507143 and physics/0507157


More on classificationGray: how to do Baysian optimal

classification with massive dataset: nonparametric baysian classifiers

Optimal decision boundary

Star density

Quasar density

dens

ity

f(x

)


Trip report (Part 2)

PHYSTAT 05 - Oxford 12th - 15th September 2005

Statistical problems in Particle Physics, Astrophysics Cosmology


Outline

Statistical software for physics Some new algorithms for physics Astronomy


Software for Statistics (for Physics) by Jim Linnemann (1) “R is a language and environment for

statistical computing and graphics” R - standard tool of professional research

statisticians: Elegant data manipulation language Command prompt and macros, interpreted, no GUI

yet Very broad package library, trivial download and

extension An interface between R and ROOT:

ROOT TTrees can be read from R prompt Vice versa doesn’t work yet


Software for Statistics (for Physics) by Jim Linnemann (2)

Web page of statistical resources:http://www.pa.msu.edu/people/linnemann/stat_resources.html

Contains links to High Energy Physics analysis software Astrophysics analysis software General purpose statistical resources Multivariate analysis and statistical

learning


Software for Statistics (for Physics) by Jim Linnemann (3) Proposed to create a physics-oriented

repository of statistical software Discussing now with Fermilab Computing

Division Hierarchy of purposes:

Archive for software associated with papers Small packages: calculation of significance, limits,

goodness-of-fit tests Packages

Medium-sized packages: MC, TerraFerma, StatPatternRecognition

Component library


R“Easy data analysis using R”

by Marc Paterno

R is “an implementation of the S language” John Chambers, the author of S, received 1998

ACM Software System award for “the S system, which has forever altered the way

people analyze, visualize, and manipulate data ... S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers. “

Available as Free Software under the GNU General Public License


R – statistical plots R provides a variety of useful plot types which

are not widely known to the physics community, including:

dot plot: replacement for pie charts and bar charts. Splom: scatter plot matrix, showing all pairwise

correlations for a set of variables box-and-whisker plot: for summary comparison of a

large number of 1d distributions quantile and QQ plot: for sensitive comparison of 2

distribution There are more special-purpose plots, and many

statistical tools come with dedicated plot styles


R – a boxplot

Multiple boxplots are more informative than profile histograms in case of asymmetric distributions or outliers in data


R – a scatter plot matrix The scatter plot

matrix is an interesting tool for quickly identifying pairs of quantities with interesting relationships

Interesting correlations are easily visible

Unbinned – no features lost

Toy jet resolution simulation


R – QQ plots Studies show that human perception is poor at evaluating similar histograms

Quantile-quantile plots are simpler to analyze

We clearly see even a small difference – second jet’s NoCs distribution has a larger high-end tail


R An R session can be saved to disk and

application state recovered at a later time

The saved session is platform neutral R can read many data formats:

Text files, common spreadsheet formats Oracle, MySQL, SQLite or any ODBC database DCOM and CORBA Other statistical packages format Even ROOT TTrees now – local development

at Fermilab, allows to read “simple” trees


R Additional functionality comes in packages Users have all tools to create and distribute

their own packages Discovery and installation of new packages is

easy Uniform documentation model is observed At this moment there are 590 add-on

packages available in the main repository CRAN

Many of these packages present not just one tool, but a large family of tools


Goodness-of-Fit toolkitMaria Grazia Pia presented an update on the

Goodness-of-Fit tookit Algorithms for binned distributionsAlgorithms for binned distributions

Anderson-Darling test Chi-squared test Fisz-Cramer-von Mises test Tiku test (Cramer-von Mises test in chi-squared approximation)

Algorithms for unbinned distributionsAlgorithms for unbinned distributions Anderson-Darling test Cramer-von Mises test Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) Kolmogorov-Smirnov test Kuiper test Tiku test (Cramer-von Mises test in chi-squared approximation)

Goal: provide all 2-sample GoF tests existing in statistical Goal: provide all 2-sample GoF tests existing in statistical literatureliterature


sPlot (1)

A statistical tool to unfold data distributions

To be added to ROOT soon Several publications from BaBaR

using sPlot physics/0402083, to be published

in NIM


sPlot (2)


Data sifting A new algorithm for outlier detection and fitting

presented by Martin Block To be used in case of Gaussian signal with Gaussian

errors, with outliers “far away” from the good points (no “swamping”)

First, Lorentzian minimization is performed for all data points

Then this Lorentzian fit is used as the initial estimate of the theoretical curve and the chisquare of each point w.r.t. this curve is computed

A cut is applied to reject the points too far from the curve and the chisquare fit of the remaining points is performed.

Parameters and parameter errors are estimated, with renormalization to take into account that the dataset has been truncated


Astrophysics


Similarity to HEPHEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC

Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time

Domain 30-100m telescopesSimilar trends with a 20 year delay,

fewer and ever bigger projects…increasing fraction of cost is in software…more conservative engineering…

Can the exponential continue, or will be logistic?What can astronomy learn from High Energy Physics?

Alex Szalay,John Hopkins University


Why is astronomy different? Especially attractive for the wide public It has no commercial value

No privacy concerns, freely share results with others

Great for experimenting with algorithms Data has more dimensions

Spatial, temporal, cross-correlations Diverse and distributed

Many different instruments from many different places and many different times

Many different interesting questions

Alex Szalay,John Hopkins University


Data in astronomy

Astronomers have a few hundred TB now

Data doubles every year Data is public after 1 year Same access for everyone


Today’s questions Discoveries

need fast outlier detection Spatial statistics

Fast correlation and power spectrum codes (CMB + galaxies)

Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere)

Time-domain: Transients, supernovae, periodic variables Moving objects, killer’ asteroids, Kuiper-belt

objects….


Other challenges Statistical noise is smaller and smaller

Error matrix larger and larger (Planck…) Systematic errors becoming dominant

De-sensitize against known systematic errors Optimal subspace filtering (…SDSS stripes…)

Comparisons of spectra to models 106 spectra vs 108 models (Charlot…)

Detection of faint sources in multi-spectral images

How to use all information optimally (QUEST…) Efficient visualization of ensembles of 100M+

data points


Virtual Observatory International Virtual Observatory Alliance:

formed in June 2002 with a mission to facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.

Aim: all astronomical data accessible from a desktop


Virtual observatory


Summary for astronomy Databases became an essential part of

astronomy: most data access will soon be via digital archives

Data at separate locations, distributed worldwide, evolving in time: move analysis, not data!

Good scaling of statistical algorithms essential Many outstanding problems in astronomy are

statistical, current techniques inadequate, we need help!

The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!


Conclusions The conference gave a good picture of

the general trends in statistics for HEP and astronomy

There are more interesting algorithms that their authors would like to see in ROOT, discussions are going on

There are things that people find useful in other systems and that we don’t have in ROOT yet and should add in the near future

A very interesting conference!


References

Conference Web site Talk slides are attached to the program

http://www.physics.ox.ac.uk/phystat05/programme.htm

For more information on the subjects look at the recommended readings page

http://www.physics.ox.ac.uk/phystat05/reading.htm

Expected soon to have the conference proceedings available online

Documents

SFT group meeting 1 Phystat05 : Trip Report A. Kreshuk, L. Moneta