Upload
harold-brown
View
218
Download
1
Embed Size (px)
Citation preview
SFT group meeting 1
Phystat05 : Trip Report
A. Kreshuk, L. Moneta
SFT group meeting 2
Phystat History Started in Jan. 2000 at CERN Workshop on Confidence Limits
organized by F. James and L. Lyons Only particle physicists
Fermilab (2000) Still focused on limits
Durham (2002) wider range of statistical topics in HEP (also partons
dist.)
SLAC (2003 ) partecipation from Astronomists and many statisticians
SFT group meeting 3
Phystat05 Oxford 12-15 September 2005
Treated various topics related to statistics (including software) Contributions with people from the high energy physics,
astronomy and statistics community
~ 80 people
SFT group meeting 4
Conference ProgramPlenary sections
half physicistshalf statisticians
SFT group meeting 5
Conference Program (2)Parallel sections (Monday + Wednesday afternoons)
Software
Event classification
Limits
SFT group meeting 6
Conference TopicsFrequentist vs baysianConfidence Limits
Nuisance parameters problemMultivariate analysis (event classification)Statistical software and toolsAstrophysics Goodness of Fit Unfolding, Time Series,...
SFT group meeting 7
Frequentist vs BaysianNice review from Sir Cox (Oxford)
Frequentist and Baysian approach to statistical inference
Problems working with Baysian analysisLeDiberder (BaBar)
Analysis of B Problems with prior choice
frequentist
Bayesian (II)
SFT group meeting 8
Nuisance Parameters Problem with statistical treatment of
uncertainties in nuisance parameter Typical problem:
Nobs = * L * A + bUncertainty in background and acceptance affect estimate
of physical parameter .
Statistical uncertainties Number of events in side bands
Systematic uncertainties Shape of background
Coverage of these parameters is required in a frequentist analysis
Importance for LHC ( see Kyle Krammer talk)
SFT group meeting 9
Kyle Cranmer : Statistical Challenges of the LHC
Gary Feldman PHYSTAT 05 15 September 2005 10
Why 5 ?
LHC searches: 500 searches each of which has 100 resolution elements (mass, angle bins, etc.) x 5 x 104 chances to find something.
One experiment: False positive rate at 5 (5 x 104) (3 x 10-7) = 0.015. OK.
Two experiments: Allowable false positive rate: 10. 2 (5 x 104) (1 x 10-4) = 10 3.7 required. Required other experiment verification:
(1 x 10-3)(10) = 0.01 3.1 required. Caveats: Is the significance real? Are there common
systematic errors?
SFT group meeting 11
Setting Limits with Nuisance parameters
Various techniques presented to set limits with nuisance parameters Baysian methods (used by CDF) Profile likelihood (Rolke)
Method used in MINUIT (Minos) Full Neyman construction (Punzi, Cranmer)
Importance to check coverage whatever method is chosen Important for claiming 5 discoveries at LHC Comparison with Cousins-Highland technique used at
LEP
14th October SFT group meeting 12
14th October SFT group meeting 13
14th October SFT group meeting 14
14th October SFT group meeting 15
Gary Feldman PHYSTAT 05 15 September 2005 16
Bayesian with Coverage
Joel Heinrich presented a decision by CDF to do Bayesian analyses with priors that cover. Advantage is Bayesian conditioning with frequentist coverage. Possibly the maximum amount of work for the experimenter.
Example of coveragewith a single Poisson with normalization and background nuisance parameters:
Flat priors
SFT group meeting 17
Profile Likelihood Method Rolke:
eliminating the nuisance parameters via profile likelihood Neyman construction replaced by the
-lnL hill-climbing approximation. Same method present in MINUIT (MINOS) The coverage is good with some minor undercoverage. Present also in ROOT in class TRolke
Bkg rate
sign
al r
ate
signal rate
14th October SFT group meeting 18
Gary Feldman PHYSTAT 05 15 September 2005 19
Full Neyman Constructions
Both Giovanni Punzi and Kyle Cranmer attempted full Neyman constructions for both signal and nuisance parameters.
I don’t recommend you try this at home for the following reasons:
The ordering principle is not unique. Both Punzi and Cranmer ran into some problems.
The technique is not feasible for more than a few nuisance parameters.
It is unnecessary since removing the nuisance parameters through profile likelihood works quite well.
14th October SFT group meeting 20
Gary Feldman PHYSTAT 05 15 September 2005 21
Event Classification
The problem: Given a measurement of an event X = (x1,x2,…xn), find the function F(X) which returns 1 if the event is signal (s) and 0 if the event is background (b) to optimize a figure of merit, say
signal.
s b for discovery or s s+b for an established
Gary Feldman PHYSTAT 05 15 September 2005 22
Theoretical Solution
In principle the solution is straightforward: Use a Monte Carlo simulation to calculate the likelihood ratio Ls(X)/Lb(X) and derive F(X) from it. By the Neyman-Pearson Theorem, this is the optimum solution.
Unfortunately, this does not work due to the “curse of dimensionality.” In a high-dimension space, even the largest data set is sparse with the distance between neighboring events comparable to the radius of the space.
SFT group meeting 23
Practical Solutionuse brute force from computers.One gives the computer samples of signal
and background events and lets the computer figure out what F(X) is.Artificial Neural networksDecision Trees
Interested sparked by J. Friedman talk at Phystat03Recent techniques to increase decision power by
combining effectively many trees i.e. Boosted decision trees
14th October SFT group meeting 24
Decision Tree
• Go through all PID variables and find best variable and value to split events.
• For each of the two subsets repeat the process
• Proceeding in this way a tree is built.
• Ending nodes are called leaves.
Background/Signal
Gary Feldman PHYSTAT 05 15 September 2005 25
Rules and Bagging Trees
Jerry Friedman gave a talk on rules, which effectively combines a series of trees.
Harrison Prosper gave a talk (for Ilya Narsky) on bagging (Bootstrap AGGregatING) trees. In this technique, one builds a collection of trees by selecting a sample of the training data and, optionally, a subset of the variables.
Results on significance of B e at BaBar
Single decision tree 2.16 Boosted decision trees 2.62 (not optimized)
Bagging decision trees 2.99
SFT group meeting 26
Boosted Decision Trees use of boosted trees in MiniBooNE (B. Roe) Misclassified events in one tree are given a higher weight
and a new tree is generated. Repeat to generate 1000 trees. The final classifier is a weighted sum of all of the trees.
Comparison with neural networks (ANN): Boosting
better than ANN by 1.2-1.8
More robust
% of signal retained
52 variables
21 variables
ANN/ Boosted Trees bkg events
SFT group meetingPhystat 2005 27
14th OctoberHarrison Prosper
StatPatternRecognition: A C++ package for multivariate classification
Implemented classifiers and algorithms: binary split linear and quadratic discriminant analysis decision trees bump hunting algorithm (PRIM, Friedman & Fisher) AdaBoost bagging and random forest algorithms AdaBoost and Bagger are capable of
boosting/bagging any classifier implemented in the package
Described in: I. Narsky, physics/0507143 and physics/0507157
SFT group meeting 28
More on classificationGray: how to do Baysian optimal
classification with massive dataset: nonparametric baysian classifiers
Optimal decision boundary
Star density
Quasar density
dens
ity
f(x
)
14th October SFT group meeting 29
Trip report (Part 2)
PHYSTAT 05 - Oxford 12th - 15th September 2005
Statistical problems in Particle Physics, Astrophysics Cosmology
14th October SFT group meeting 30
Outline
Statistical software for physics Some new algorithms for physics Astronomy
14th October SFT group meeting 31
Software for Statistics (for Physics) by Jim Linnemann (1) “R is a language and environment for
statistical computing and graphics” R - standard tool of professional research
statisticians: Elegant data manipulation language Command prompt and macros, interpreted, no GUI
yet Very broad package library, trivial download and
extension An interface between R and ROOT:
ROOT TTrees can be read from R prompt Vice versa doesn’t work yet
14th October SFT group meeting 32
Software for Statistics (for Physics) by Jim Linnemann (2)
Web page of statistical resources:http://www.pa.msu.edu/people/linnemann/stat_resources.html
Contains links to High Energy Physics analysis software Astrophysics analysis software General purpose statistical resources Multivariate analysis and statistical
learning
14th October SFT group meeting 33
Software for Statistics (for Physics) by Jim Linnemann (3) Proposed to create a physics-oriented
repository of statistical software Discussing now with Fermilab Computing
Division Hierarchy of purposes:
Archive for software associated with papers Small packages: calculation of significance, limits,
goodness-of-fit tests Packages
Medium-sized packages: MC, TerraFerma, StatPatternRecognition
Component library
14th October SFT group meeting 34
R“Easy data analysis using R”
by Marc Paterno
R is “an implementation of the S language” John Chambers, the author of S, received 1998
ACM Software System award for “the S system, which has forever altered the way
people analyze, visualize, and manipulate data ... S is an elegant, widely accepted, and enduring software system, with conceptual integrity, thanks to the insight, taste, and effort of John Chambers. “
Available as Free Software under the GNU General Public License
14th October SFT group meeting 35
R – statistical plots R provides a variety of useful plot types which
are not widely known to the physics community, including:
dot plot: replacement for pie charts and bar charts. Splom: scatter plot matrix, showing all pairwise
correlations for a set of variables box-and-whisker plot: for summary comparison of a
large number of 1d distributions quantile and QQ plot: for sensitive comparison of 2
distribution There are more special-purpose plots, and many
statistical tools come with dedicated plot styles
14th October SFT group meeting 36
R – a boxplot
Multiple boxplots are more informative than profile histograms in case of asymmetric distributions or outliers in data
14th October SFT group meeting 37
R – a scatter plot matrix The scatter plot
matrix is an interesting tool for quickly identifying pairs of quantities with interesting relationships
Interesting correlations are easily visible
Unbinned – no features lost
Toy jet resolution simulation
14th October SFT group meeting 38
R – QQ plots Studies show that human perception is poor at evaluating similar histograms
Quantile-quantile plots are simpler to analyze
We clearly see even a small difference – second jet’s NoCs distribution has a larger high-end tail
14th October SFT group meeting 39
R An R session can be saved to disk and
application state recovered at a later time
The saved session is platform neutral R can read many data formats:
Text files, common spreadsheet formats Oracle, MySQL, SQLite or any ODBC database DCOM and CORBA Other statistical packages format Even ROOT TTrees now – local development
at Fermilab, allows to read “simple” trees
14th October SFT group meeting 40
R Additional functionality comes in packages Users have all tools to create and distribute
their own packages Discovery and installation of new packages is
easy Uniform documentation model is observed At this moment there are 590 add-on
packages available in the main repository CRAN
Many of these packages present not just one tool, but a large family of tools
14th October SFT group meeting 41
Goodness-of-Fit toolkitMaria Grazia Pia presented an update on the
Goodness-of-Fit tookit Algorithms for binned distributionsAlgorithms for binned distributions
Anderson-Darling test Chi-squared test Fisz-Cramer-von Mises test Tiku test (Cramer-von Mises test in chi-squared approximation)
Algorithms for unbinned distributionsAlgorithms for unbinned distributions Anderson-Darling test Cramer-von Mises test Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) Kolmogorov-Smirnov test Kuiper test Tiku test (Cramer-von Mises test in chi-squared approximation)
Goal: provide all 2-sample GoF tests existing in statistical Goal: provide all 2-sample GoF tests existing in statistical literatureliterature
14th October SFT group meeting 42
sPlot (1)
A statistical tool to unfold data distributions
To be added to ROOT soon Several publications from BaBaR
using sPlot physics/0402083, to be published
in NIM
14th October SFT group meeting 43
sPlot (2)
14th October SFT group meeting 44
Data sifting A new algorithm for outlier detection and fitting
presented by Martin Block To be used in case of Gaussian signal with Gaussian
errors, with outliers “far away” from the good points (no “swamping”)
First, Lorentzian minimization is performed for all data points
Then this Lorentzian fit is used as the initial estimate of the theoretical curve and the chisquare of each point w.r.t. this curve is computed
A cut is applied to reject the points too far from the curve and the chisquare fit of the remaining points is performed.
Parameters and parameter errors are estimated, with renormalization to take into account that the dataset has been truncated
14th October SFT group meeting 45
Astrophysics
14th October SFT group meeting 46
Similarity to HEPHEP Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC
Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time
Domain 30-100m telescopesSimilar trends with a 20 year delay,
fewer and ever bigger projects…increasing fraction of cost is in software…more conservative engineering…
Can the exponential continue, or will be logistic?What can astronomy learn from High Energy Physics?
Alex Szalay,John Hopkins University
14th October SFT group meeting 47
Why is astronomy different? Especially attractive for the wide public It has no commercial value
No privacy concerns, freely share results with others
Great for experimenting with algorithms Data has more dimensions
Spatial, temporal, cross-correlations Diverse and distributed
Many different instruments from many different places and many different times
Many different interesting questions
Alex Szalay,John Hopkins University
14th October SFT group meeting 48
Data in astronomy
Astronomers have a few hundred TB now
Data doubles every year Data is public after 1 year Same access for everyone
14th October SFT group meeting 49
Today’s questions Discoveries
need fast outlier detection Spatial statistics
Fast correlation and power spectrum codes (CMB + galaxies)
Cross-correlations among different surveys (sky pixelization + fast harmonic transforms on sphere)
Time-domain: Transients, supernovae, periodic variables Moving objects, killer’ asteroids, Kuiper-belt
objects….
14th October SFT group meeting 50
Other challenges Statistical noise is smaller and smaller
Error matrix larger and larger (Planck…) Systematic errors becoming dominant
De-sensitize against known systematic errors Optimal subspace filtering (…SDSS stripes…)
Comparisons of spectra to models 106 spectra vs 108 models (Charlot…)
Detection of faint sources in multi-spectral images
How to use all information optimally (QUEST…) Efficient visualization of ensembles of 100M+
data points
14th October SFT group meeting 51
Virtual Observatory International Virtual Observatory Alliance:
formed in June 2002 with a mission to facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.
Aim: all astronomical data accessible from a desktop
14th October SFT group meeting 52
Virtual observatory
14th October SFT group meeting 53
Summary for astronomy Databases became an essential part of
astronomy: most data access will soon be via digital archives
Data at separate locations, distributed worldwide, evolving in time: move analysis, not data!
Good scaling of statistical algorithms essential Many outstanding problems in astronomy are
statistical, current techniques inadequate, we need help!
The Virtual Observatory is a new paradigm for doing science: the science of Data Exploration!
14th October SFT group meeting 54
Conclusions The conference gave a good picture of
the general trends in statistics for HEP and astronomy
There are more interesting algorithms that their authors would like to see in ROOT, discussions are going on
There are things that people find useful in other systems and that we don’t have in ROOT yet and should add in the near future
A very interesting conference!
14th October SFT group meeting 55
References
Conference Web site Talk slides are attached to the program
http://www.physics.ox.ac.uk/phystat05/programme.htm
For more information on the subjects look at the recommended readings page
http://www.physics.ox.ac.uk/phystat05/reading.htm
Expected soon to have the conference proceedings available online