Barbara Mascialino, INFN Genova An update on the Goodness of Fit Statistical Toolkit B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo

Barbara Mascialino, INFN Genova

An update on the An update on the Goodness of Fit Goodness of Fit

Statistical ToolkitStatistical Toolkit

B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo

http://www.ge.infn.it/geant4/analysis/HEPstatistics

http://www.ge.infn.it/statisticaltoolkit

4th Geant4 Space Users’ Workshop


Goodness of Fit testing

Regression testing– Throughout the software life-cycle

Online DAQ– Monitoring detector behaviour w.r.t. a reference

Simulation validation– Comparison with experimental data

Reconstruction– Comparison of reconstructed vs. expected distributions

Physics analysis– Comparison with theoretical distributions– Comparisons of experimental distributions

Goodness-of-fitGoodness-of-fit testing is the mathematical foundationmathematical foundation for the comparison of data distributionscomparison of data distributions

THEORETICALDISTRIBUTION

SAMPLE

ONE-SAMPLE PROBLEMONE-SAMPLE PROBLEM

SAMPLE 2SAMPLE 1

TWO-SAMPLE PROBLEMTWO-SAMPLE PROBLEM

Use cases in experimental physics


G.A.P Cirrone, S. Donadio, S. Guatelli, A. Mantero, B. Mascialino, S. Parlati, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo

“A Goodness-of-Fit Statistical Toolkit”IEEE- Transactions on Nuclear Science (2004), 51 (5): 2056-2063.

B. Mascialino, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo“New developments of the Goodness-of-Fit Statistical Toolkit”

IEEE- Transactions on Nuclear Science (2006), 53 (6), to be published

http://www.ge.infn.it/statisticaltoolkit/


Software process guidelines

Adopt a process– software quality

Unified ProcessUnified Process, specifically tailoredtailored to the project– practical guidance and tools from the RUPRUP– both rigorous and lightweight– mapping onto ISO 15504 (and CMM)

Incremental and iterative life-cycle

1st cycle: 2-sample GoF tests– 1-sample GoF in preparation


Architectural guidelinesThe project adopts a solid architectural architectural approach

– to offer the functionalityfunctionality and the qualityquality needed by the users– to be maintainablemaintainable over a large time scale– to be extensibleextensible, to accommodate future evolutions of the requirements

Component-based architectureComponent-based architecture– to facilitate re-use and integration in diverse frameworks– layer architecture pattern– core component for statistical computation– independent components for interface to user analysis environments

DependenciesDependencies– no dependence on any specific analysis tool– can be used by any analysis tools, or together with any analysis tools– offer a (HEP) standard (AIDA) for the user layer



The algorithms are specialised on the kind of distribution The algorithms are specialised on the kind of distribution (binned/unbinned)(binned/unbinned)


GoF algorithms in the Statistical Toolkit

Unbinned distributions – Anderson-Darling test– Anderson-Darling approximated test– Cramer-von Mises test– Generalised Girone test– Goodman test (Kolmogorov-Smirnov test in chi-squared approximation)– Kolmogorov-Smirnov test– Kuiper test– Tiku test (Cramer-von Mises test in chi-squared approximation)– Weighted Kolmogorov-Smirnov test (2 flavours)– Weighted Cramer-von Mises test

TWO-SAMPLE PROBLEM

Binned distributions – Anderson-Darling test– Anderson-Darling approximated test– Chi-squared test – Fisz-Cramer-von Mises test– Tiku test (Cramer-von Mises test in chi-squared approximation)

It is t

he most

complete so

ftware

for the co

mparison of tw

o distrib

utions,

even among

commerci

al/profess

ional statis

tics t

ools.

It provid

es all 2

-sample (e

df) GoF algorith

ms

existin

g in st

atistic

s lite

rature


Simple user layerSimple user layer

Shields the user from the complexity of the underlying algorithms and design

Only deal with the user’s analysis objectsanalysis objects and choice of comparison algorithmcomparison algorithmFirst release: user layer for AIDA analysis objectsFirst release: user layer for AIDA analysis objects

– LCG Architecture Blueprint, Geant4 requirementSecond release: added user layer for ROOT analysis objectsSecond release: added user layer for ROOT analysis objects

– in response to user requirements

User Layer


Which test to use?Which test to use?

Do weDo we really need really need such a wide such a wide collection of GoF tests? Why?collection of GoF tests? Why?

Which is the most appropriatemost appropriate test to compare two distributions?

How “goodgood” is a test at recognizing real equivalent distributions and rejecting fake ones?

The choice of the most suitable GoF test can be performed on the basis of two different criteria:

– Computational performance

– Statistical performance (power)

Which test

Which test

to use?to use?


AVERAGE CPU TIMEAVERAGE CPU TIME Binned Binned DistributionsDistributions

Unbinned Unbinned DistributionsDistributions

Anderson-Darling (0.69±0.01) ms (16.9±0.2) ms

Anderson-Darling (approximated) (0.60±0.01) ms (16.1±0.2) ms

Chi-squared (0.55±0.01) ms

Cramer-von Mises (0.44±0.01) ms (16.3±0.2) ms

Generalised Girone (15.9±0.2) ms

Goodman (11.9±0.1) ms

Kolmogorov-Smirnov (8.9±0.1) ms

Kuiper (12.1±0.1) ms

Tiku (0.69±0.01) ms (16.7±0.2) ms

Watson (14.2±0.1) ms

Weighted Kolmogorov-Smirnov (AD) (14.0±0.1) ms

Weighted Kolmogorov-Smirnov (Buning) (14.0±0.1) ms

Weighted Cramer-von Mises (14.0±0.1) ms

A) Performance of the GoF testsA) Performance of the GoF tests


B) Power of GoF testsB) Power of GoF tests

SystematicSystematic study of allall existing GoF tests in progress– made possible by the extensive collection of tests in the Statistical Toolkit– GoF tests power evaluated in a variety of alternative situations considered

No clear winner: clear winner: the statistical performance of a test depends on the features of the distributions to be compared (skewness and tailweight) and on the sample size

Practical recommendations1) first classify the type of the distributions in terms of skewness and

tailweight2) choose the most appropriate test given the type of distributions evaluating

the best test by means of the quantitative model proposed

Topic still subject to research activity in the domain of statistics

jkikjikjiiijk NTTSSPower )()( 2121

p<0.0001

General recipe

The power of a test is the probability of rejecting the null

hypothesis correctly


Examples of practical applications


Statistical Toolkit Usage

Geant4 physics validation– rigorous approach: quantitative evaluation of Geant4 physics models with

respect to established reference data– see for instance: K. Amako et al., Comparison of Geant4 electromagnetic

physics models against the NIST reference dataIEEE Trans. Nucl. Sci. 52- 4 (2005) 910-918

LCG Simulation Validation project– see for instance: A. Ribon, Testing Geant4 with a simplified calorimeter

setup, http://www.ge.infn.it/geant4/events/july2005

CMS– validation of “new” histograms w.r.t. “reference” ones in OSCAR

Validation Suite

Usage also in space science, medicine, statistics, etc.


Electron Stopping Power

H0 REJECTION AREA

p-value stability study

centre

Experimental set-up

Physics models under test:• Geant4 Standard• Geant4 Low Energy – Livermore• Geant4 Low Energy – Penelope

Reference data:• NIST ESTAR - ICRU 37

p-va

lue

Z

Geant4 LowE PenelopeGeant4 LowE Penelope Geant4 StandardGeant4 StandardGeant4 LowE EEDLGeant4 LowE EEDLNIST - XCOMNIST - XCOM

The three Geant4 models are equivalent

Geant4 LowE PenelopeGeant4 LowE Penelope Geant4 StandardGeant4 Standard

Geant4 LowE EEDLGeant4 LowE EEDL

Validation of Geant4 e.m. physics models vs. NIST reference data

2 test(to include data

uncertainties in the computation of the test statistics value)


Validation of Geant4 Atomic Relaxation vs NIST reference data

Shell-endShell-end Kolmogorov-Kolmogorov-Smirnov Smirnov DD

p-valuep-value

10 0.0192 1

11 0.0175 1

13 0.0250 1

14 0.0256 1

18 0.0294 1

19 0.0312 1

21 0.1429 0.997085

22 0.0588 1

Geant4

○ NIST

Fluorescence - Shell-start 3


Validation of Geant4 electromagnetic and hadronic

models against proton dataLow Energy EM – ICRU49: p, ions Low Energy EM – Livermore: , e-Standard EM : e+HadronElastic with BertiniElasticBertini InelasticBertini Inelastic

p-valuep-value

CvM KS AD

Left branch 0.977

Right branch 0.985

Whole curve 0.994

LowE EM – ICRU49

BertiniElastic

CvM Cramer-von Mises testKS Kolmogorov-Smirnov testAD Anderson-Darling test

0.5 M events

mm

Geant4Experimental data

Bertini Inelastic


2 not appropriate

(< 5 entries in some bins, physical

information would be lost if rebinned)

Anderson-Darling

Ac (95%) =0.752

Test beam at BessyTest beam at BessyBepi-Colombo missionBepi-Colombo mission

Energy (keV)

Cou

nts

X-ray fluorescence spectrum in Iceand basalt(EIN=6.5 keV)

Very complex distributions

Experimental measurements are comparable with Geant4 simulations


Ave

rage

ene

rgy

depo

sit (

MeV

)

Depth in the phantom (cm)

Average energy deposit of GCR p

GCR p

4 cm Al - Binary set

4 cm Al - Bertini10 cm water - Binary

10 cm water – EM4 cm Al – EM

10 cm water - Bertini

Comparison of alternative vehicle concepts in human missions to Mars

Reference: rigid structures as in the ISS (2 - 4 cm Al)

Kolmogorov-Smirnov test

Multi-layer + 10 cm water equivalent to 4 cm Al Multi-layer + 5 cm water equivalent to 2.15 cm Al

An inflatable habitat exhibits a shielding capability equivalent to a conventional rigid one

Shieldingmaterial

Energy deposited in phantom (MeV)

EM Bertini Binary

ML + 5 cm water 73.5 ± 0.3 130.2 ± 0.5 119.3 ± 0.4

ML + 10 cm water 71.9 ± 0.3 128.0 ± 0.5 117.3 ± 0.5

4 cm Al 72.9 ± 0.3 127.5 ± 0.5 117.0 ± 0.4

2.15 cm Al 73.9 ± 0.3 130.5 ± 0.5 119.3 ± 0.5Inflatable habitat vs

a conventional rigid habitat


ConclusionsA novel, complete software software toolkit for statistical analysis is being developed

– all the two-sample GoF tests available in statistical domain + chi-squared test– rigorous architectural design– rigorous software process

It is the most complete software for the comparison of two It is the most complete software for the comparison of two distributions, even among commercial/professional statistics tools.distributions, even among commercial/professional statistics tools.

A systematic study of the power of GoF tests is in progress– unexplored area of research

Application in various domains– Geant4, HEP, space science, medicine…

Feedback and suggestions are very much appreciated

Documents

Barbara Mascialino, INFN Genova An update on the Goodness of Fit Statistical Toolkit B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo