40
RooStatsCms: a tool for RooStatsCms: a tool for analyses modelling, analyses modelling, combination and statistical combination and statistical studies studies D. Piparo, G. Schott, G. Quast D. Piparo, G. Schott, G. Quast Institut f Institut f ür Experimentelle Kernphysik ür Experimentelle Kernphysik Universität Karlsruhe Universität Karlsruhe

Outline

  • Upload
    effie

  • View
    96

  • Download
    0

Embed Size (px)

DESCRIPTION

RooStatsCms: a tool for analyses modelling, combination and statistical studies D. Piparo, G. Schott, G. Quast Institut f ür Experimentelle Kernphysik Universität Karlsruhe. Outline. The need for a tool RooStatsCms (RSC) A RooFit interlude The three parts Modelling The datacard - PowerPoint PPT Presentation

Citation preview

Page 1: Outline

RooStatsCms: a tool for analyses RooStatsCms: a tool for analyses modelling, combination and modelling, combination and statistical studiesstatistical studies

D. Piparo, G. Schott, G. QuastD. Piparo, G. Schott, G. Quast

Institut fInstitut für Experimentelle Kernphysikür Experimentelle KernphysikUniversität KarlsruheUniversität Karlsruhe

Page 2: Outline

D. Piparo 2

OutlineOutline

• The need for a tool

• RooStatsCms (RSC)

• A RooFit interlude

• The three parts– Modelling

• The datacard• Inspect your model

– Statistical studies and limits• Profile Likelihood• Hypothesis separation and “modified frequentist approach”

– Exclusion

– Plotting classes

19.11.08

Page 3: Outline

D. Piparo19.11.08 3

The need for a toolThe need for a tool

• No prexisting structured statistic software framework in CMS: G. Quast, G. Schott and DP developed RooStatsCms

NEEDS:

• Reliable implementation of multiple statistical methods• Combine analyses:

– Stronger limits on quantities like Higgs production cross section, mass ...

• Do not replace existing analyses but complement their results

• Easy user interface

• Satisfactory documentation (no black boxes)

• Examples and tutorials

Page 4: Outline

D. Piparo19.11.08 4

RooStatsCmsRooStatsCms• Originally thought for the CMS Higgs Working Group and a CMS (EKP) exclusive product• Based on RooFit (Part of the ROOT distribution)• Three parts:

– Modelling and combination– Statistical methods – Advanced graphic routines

• It comes with CINT dictionaries (macros, interactive root).• Available to CMS and EKP at: www-ekp.physik.uni-karlsruhe.de/~RooStatsCms

– Visit our wiki for username and password – Statistical methods and graphic routines public: www-ekp.physik.uni-karlsruhe.de/~RooStatsKarlsruhe

• Big effort for documentation:

1. RSC website and Doxygen of every class, method and member

2. Wikipages with links to RSC presentations (~15) and workshop• https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWGRooStatsCms• http://www-ekp.physik.uni-karlsruhe.de/~twiki/bin/view/EkpCms/RooStatsCms

3. An internal CMS note in preparation

Page 5: Outline

D. Piparo19.11.08 5

RooStatsCms - structure 1/2RooStatsCms - structure 1/2

• Already 33 classes!• All of them inherit from TObject: persistency and reflexion• Moreover:

– Programs to compile– Macros for the interpreter– Various utilities in the Rsc namespace (TH1F median,..)

• Class design-wise structure

Page 6: Outline

D. Piparo19.11.08 6

RooStatsCms - structure 2/2RooStatsCms - structure 2/2

Directory Description

doc Links to the documentation

bin Executables after make exe command (see progs dir)

interface Header files

lib Here the library after the make command: libRooStatsCms.so

macros The macros for cint

progs C++ programs to be compiled and linked against the library

scripts Utilities script: python card maker, doxy, environment

src The sources

test …well the directory for the tests!

Directory-wise structure

• Structure “À la CMSSW”: ready to compile in the CMS framework with a newer RooFit

Page 7: Outline

D. Piparo19.11.08 7

RooFit interlude: ouverture RooFit interlude: ouverture • Toolkit for data modeling• Model distribution of observable x in terms of

– parameter of interest p– other parameters q to describe detector effects (resolution ,efficiency)– Probability density function (pdf) F (x;p,q)– normalized over range of observable x w.r.t. the parameters p and q

• RooFit provides the functionality for– building these probability density functions

• scalable to complex models

– maximum likelihood fitting (binned and unbinned)– visualization of the pdf– toy MC generator

Page 8: Outline

D. Piparo19.11.08 8

RooFit interlude: functionalityRooFit interlude: functionality

• Package developed, originally for BaBar analysis (by W. Verkerke and D. Kirkby)– actively maintained by W. Verkerke in view of LHC analysis– Web site: http://roofit.sourceforge.net– Much material shown taken from Wouter’s presentations

• see 200 slides presented at French statistics school (http://sos.in2p3.fr)• Users Manual in the ROOT site:ftp://root.cern.ch/root/doc/RooFit_Users_Manual_2.91-33.pdf

Page 9: Outline

D. Piparo19.11.08 9

RooFit interlude: designRooFit interlude: design• Mathematical entities are represented as C++ objects

Page 10: Outline

D. Piparo19.11.08 10

RooFit interlude : an exampleRooFit interlude : an example

• Gaussian Pdf

• MC data generation

• Maximum likelihood fit on data

Page 11: Outline

D. Piparo19.11.08 11

RSC: A solid toolRSC: A solid tool

• RSC is in “production phase”:– Around since the beginning of the year 2008– Workshop at CERN in June– Approved results: http://cms-physics.web.cern.ch/cms-physics/public/HIG-08-008-pas.pdf– Coming soon results: HIG-008-06 HWW – CMS statistics committee blessed the tool (internal note in preparation)

• Grégory in permanent contact with them

– Interest of other working groups– Negotiations for integration in CMS Software framework (CMSSW)– Base of a common tool with Atlas

• Work in progress: firsts commits in ROOT are taking place

– New manpower: Mario Pelliccioni (former BaBar) from Universita’ di Torino

• Made in EKP (Quast, Schott, Piparo):– Personal assistance at 8th floor!

Page 12: Outline

D. Piparo19.11.08 12

RSC: Is it hard to try?RSC: Is it hard to try?Straightforward to get started on ekpcms3:

wget -O RooStatsCms.tar.gz http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/HiggsAnalysis/RooStatsCms.tar.gz?view=tar\&pathrev=V00-04-00

tar -zxf RooStatsCms.tar.gzcd RooStatsCmssource /home/piparo/set_root_RSC_environment.shsource scripts/RSCenv.shmakemake execd macros/examples/root profilelikelihood_htt.cxxroot qqhtt_-2lnQ_distributions.cxx

See also: www-ekp.physik.uni-karlsruhe.de/~RooStatsCms for detailed instructions

Page 13: Outline

D. Piparo19.11.08 13

RSC in one slideRSC in one slide

Statisticians

A priori, I frequently

believe I am in between ...

... RooStatsCms tries to put you somehow “in between”...

Page 14: Outline

D. Piparo19.11.08 14

The Three PartsThe Three Parts

• Analyses modeling and combination

• Statistical Methods and limits

• Graphics routines

Page 15: Outline

D. Piparo19.11.08 15

Analyses modeling and combinationAnalyses modeling and combination• Modeling based on the datacard concept

• Build a complete combined analysis model from ASCII datacards

– Background and signal components of each analysis

– Shapes from parametrisation or histos

– Constraints and their correlations

– Basic syntax: include, if ...

– Two lines of C++ to produce the RooFit Pdf

• Datacard advantages:

– Automatic bookkeeping of what is done

– Factorise model from C++ code

– Easy to share

RscCombinedModel mymodel ("hzz4l");RooAbsPdf* sb_pdf=mymodel.getPdf();

ASCII Card2 analyses

Page 16: Outline

D. Piparo19.11.08 16

RSC – Modelling 2/2RSC – Modelling 2/2• Yields can be expressed as products of different terms. For example:

– Branching Ratios– Efficiencies– Cross section– Luminosity

• Each term: systematics can be included• Relate terms from one analysis to the other with correlations

Yield = BR · ε · σH · Lumi

Page 17: Outline

D. Piparo19.11.08 17

An example datacard: countingAn example datacard: counting################################## The combined model#################################// Here we specify the names of the models // built down in the card that we want// to be combined

include HZZ_4mu.rscinclude HZZ_4e.rscinclude HZZ_2mu2e.rsc

[hzz4l] model = combined components = hzz_4mu, hzz_4e, hzz_2mu2e

################################## H -> ZZ -> 4mu#################################

[hzz_4mu] variables = x x = 0 L(0 - 1)

[hzz_4mu_sig] hzz_4mu_sig_yield = 62.78 L(0 - 200)

[hzz_4mu_sig_x] model = yieldonly

[hzz_4mu_bkg]

yield_factors_number = 2

yield_factor_1 = scale scale = 1 L (0 - 3) scale_constraint = Gaussian,1,0.041

yield_factor_2 = bkg_4mu bkg_4mu = 19.93 C

[hzz_4mu_bkg_x] model = yieldonly

The combined model

The variable

Signal component description:

- Yield

- Model

Background component description: yield made of different terms.

See RscBaseModel and RscCombinedModel documentation for a complete description

Constraints syntax: <type>,par1,par2

Basic syntax

Comment

Page 18: Outline

D. Piparo19.11.08 18

An example datacard: shapesAn example datacard: shapes

// The combined model of HZZ and Hgg

include hzz_combined.rscInclude hgg_12_categories.rsc

[hgg_hzz_combined] model = combined components = hzz, hgg_cat0, hgg_cat1,..., hgg_cat11

[hgg_cat0]variables = mhmh = 115 L(90 - 180) // [GeV/c^{2}]

[hgg_cat0_sig] yield_factors_number = 3 yield_factor_1 = lumi lumi = 1 C yield_factor_2 = n_events_hgg_115_cat0_sig n_events_hgg_cat0_sig = 3.9577 yield_factor_3 = scale_sig scale_sig = 1 L (0 - 5)

[hgg_cat0_sig_mh] model = fourGaussians hgg_115_cat0_sig_mh_mean1 = 114.654 +/- 0.107106 C hgg_115_cat0_sig_mh_mean2 = 115.146 +/- 2.37687 C hgg_115_cat0_sig_mh_mean3 = 114.12 +/- 0.581539 C hgg_115_cat0_sig_mh_mean4 = 109.979 +/- 11.036 C hgg_115_cat0_sig_mh_sigma1 = 0.6075 +/- 0.0888951 C hgg_115_cat0_sig_mh_sigma2 = 0.601995 +/- 129.141 C hgg_115_cat0_sig_mh_sigma3 = 2.1119 +/- 0.526549 C hgg_115_cat0_sig_mh_sigma4 = 8.16619 +/- 7.75118 C hgg_115_cat0_sig_mh_frac1 = 0.999893 +/- 0.500053 C hgg_115_cat0_sig_mh_frac2 = 0.762761 +/- 0.0870296 C hgg_115_cat0_sig_mh_frac3 = 0.98815 +/- 0.0207781 C

1. Combination of combined models

2. Counting combined with shape analyses

[hgg_cat0_bkg] number_components = 2 yield_factors_number = 3 yield_factor_1 = lumi lumi = 1 C yield_factor_2 = n_events_hgg_115_cat0_bkg n_events_hgg_cat0_bkg = 988.389 yield_factor_3 = scale_bkg scale_bkg = 1 L (0 - 5)

[hgg_cat0_bkg1] qqhtt_bkg1_yield = 1 C

[hgg_cat0_bkg2] qqhtt_bkg2_yield = 1.35 C

[hgg_cat0_bkg1_mh] model = doubleGaussian hgg_cat0_bkg_mh_mean1 = 52.3484 +/- 14.1593 C hgg_cat0_bkg_mh_mean2 = 158.962 +/- 3.21153 C hgg_cat0_bkg_mh_sigma1 = 27.1791 +/- 2.37455 C hgg_cat0_bkg_mh_sigma2 = 74.9328 +/- 70.6298 C hgg_cat0_bkg_mh_frac = 0.924937 +/- 0.0347411 C

[hgg_cat0_bkg2_mh] model = histo hgg_cat0_bkg2_mh _fileName = htt_inputs.root hgg_cat0_bkg2_mh name = background

Comment

Multiple components

Histogram and parametric models mixed

Page 19: Outline

D. Piparo19.11.08 19

A combinationA combination• Combination of CMS H→gg, H →ZZ (3 modes) 30 fb-1

• Perform a simutaneous analysis of Higgs channels:- for each analysis: each data sample is fitted simultaneously with it is own signal and background model- combination of number counting and distribution based analyses

• Significance: sqrt(2lnQ) • Various analyses• Comparison between PTDR and RSC

Page 20: Outline

D. Piparo19.11.08 20

More on constraintsMore on constraints

[combined_120_constraints_block_1] correlation_variable1 = hww_mm_120_bkg_yield correlation_variable2 = hww_ee_120_bkg_yield correlation_variable3 = hww_em_120_bkg_yield

correlation_value1 = 0.80 C correlation_value2 = 0.72 C correlation_value3 = 0.15 C

[combined_120_constraints_block_2]............

• “Same name, same pointer” principle (100% correlation)– Same name in the card → Same object in the model– Common Luminosity, cross-sections

• Partial correlation among Gaussian constraints: constraints block

Correlated Variables

Correlation Coefficients

As many blocks as needed!

Page 21: Outline

D. Piparo19.11.08 21

Analyses model structureAnalyses model structure

RscBaseModelBasic distributionsHistoHisto GaussGauss PolyPoly My modelMy model

RscMultiModel Model for each discriminating variableVariable 1Variable 1 Variable 2Variable 2

RscCompModel Different components forsignal(s) and background(s)SignalSignal Bkg1Bkg1 Bkg2Bkg2 Bkg3Bkg3

RscTotModelThe full analysisAnalysis 1Analysis 1

RscTotModelThe full analysisAnalysis 1Analysis 1

RscTotModelThe full analysisAnalysis 1Analysis 1

Statistical Methods

RscCombinedModel AnalysiscombinationCombinationCombination

Page 22: Outline

D. Piparo 22

Inspect your modelInspect your modelTwo programs to use:

• Model Diagram:creates a simple graph of the

combined model– model_diagram.exe <cardname> <modelname>

• Model Html: creates a website to browse

your combined model– model_html.exe <cardname> <modelname>

19.11.08

Page 23: Outline

D. Piparo 23

The Three PartsThe Three Parts

• Analyses modeling and combination

• Statistical Methods and limits

• Graphics routines

19.11.08

Page 24: Outline

D. Piparo 24

Profile Likelihood - 1/2Profile Likelihood - 1/2

19.11.08

Page 25: Outline

D. Piparo 25

Profile Likelihood – 2/2Profile Likelihood – 2/2• Intersection with horizontal lines gives upper limits / two sided intervals

– W.J. Metzger “Statistical Methods in Data Analysis”, Katholieke Universiteit Nijmegen, 2002.

• Systematics taken into account with penalty terms in the Likelihoods (profiling)

Likelihood scan: l maximised for each point

Interpolated scan minimum

Horizontal cuts

See PLCalcuator, PLResults, PLPlot documentation

• Minuit uses the technique to obtain the fitted parameters errors

• Significance estimator: S=sqrt(2ln(Lsb/Lb))

→ if θ0 is N signal, the scan value at 0 is directly related to S !

θ0 at minimum: 7.16+8.1-5.37

19.11.08

Page 26: Outline

D. Piparo 26

Systematics - 1/2Systematics - 1/2

19.11.08

Page 27: Outline

D. Piparo 27

Systematics - 2/2Systematics - 2/2

19.11.08

Page 28: Outline

D. Piparo 28

A PL prototype studyA PL prototype study• A prototype study: distribution of upper limits using PL and a coverage study

• Many pseudo experiments performed for each mass hypothesis

– Distribution of upper limits obtained

– Coverage: fraction of experiments in which the upper limit is indeed greater than the parameter nominal value

– Easy to do: store PLResults objects in a TTree and loop on it.

Overcoverage for low yields:

• Well known feature of the method (Cramér-Fréchet Bound)

• “Calibrate” the Likelihood

19.11.08

Page 29: Outline

D. Piparo19.11.08 29

Separation of HypothesesSeparation of Hypotheses• Analysis of search results can be formulated as separation of hypotheses:

– Identify observable which comprises the result– Specify a test statistic– Define rules for discovery and exclusion

• Use the likelihoods ratio, Q=Lsb/Lb, assuming signal+background (“s+b”) and the background-only “b” hypotheses, as test statistic.

• Consider “P-values” (also called CLS+B, 1-CLB) of -2lnQ distributions obtained from s+b and b samples

Bayesian pseudo-integration of systematics:

For every toy MC experiment, before the generation of the toy dataset, parameters affected by systematics are properly fluctuated once.

Distributions built with toy MC experiments

(LimitCalculator-HybridCalculator Class)

CLsb

1-CLb

See:

• progs/m2lnq_creator.cpp

•qqhtt_-2lnQ_distributions.cxx in macros/examples/

Page 30: Outline

D. Piparo 30

Modified frequentist method – SignificanceModified frequentist method – Significance

• CLB : background CL, measure of the compatibility of the experiment with the B-only hypothesis

• 1 – CLB : probability for a B-only experiment to give a more S+B-like likelihood ratio than the observed one

• Correspondence between CLB and the resulting significance (Gaussian approximation):- # of standard deviations of an (assumed) Gaussian distribution of the background. - Take CLB assuming the expected s+b yield (i.e. median -2lnQ for s+b distribution)

• CLS+B : measure of the compatibility of the experiment with the S+B hypothesis if CL is small ( < 5% ) the S+B hypothesis can be excluded at more than 95% CL but it does not mean that the signal hypothesis is excluded at that level

Modified frequentist approach: take CLS the signal significance, to be: CLS ≡ CLS+B / CLB (heavily used by LEP, HERA and TEVATRON experiments)

)12(2 1 BCLErfn

19.11.08

Page 31: Outline

D. Piparo 31

The benchmark analysis: HThe benchmark analysis: H→→

• Used as benchmark for the tool• Results approved by the CMS collaboration• Vector boson fusion H→ @1 fb-1

• Small signal on a significant background• No discovery expected with this lumi• Four mass hypotheses:

– 115,125,135,145 GeV

Mass N Sig

(12% sys)

N Bkg

(30% sys)

115 1.6 45.2

125 1.4 45.2

135 1.1 45.2

145 0.6 45.2

19.11.08

Page 32: Outline

D. PiparoCMS Week 32

H→H→: Significance: Significance• Significance calculated for the H→ analysis using CLb

• In this case significance does not tell us much. • The question becomes:

“Which production cross section can I exclude with the data I have?”

19.11.08

Page 33: Outline

D. PiparoCMS Week 33

Modified Frequentist method – ExclusionModified Frequentist method – Exclusion Assume to observe the expected background (i.e. median of the background distribution) and no signal• Amplify the SM production cross section by a factor necessary to obtain CLs=0.05

→ “95% exclusion”

Bands:• Assume to observe Nb + n · sqrt (Nb), where n=2,1,-1,-2 for the -2,-1,1,2 sigma band border respectively• Systematics taken into account in distributions of -2lnQ (marginalisation)

Obtained with real data

Less exclusion power than expected

More exclusion power than expected

~ 80 h on one CPU

ExclusionBandPlot Class

19.11.08

Page 34: Outline

D. PiparoCMS Week 34

How do I find the right ratio?How do I find the right ratio? RSC provides help: • RatioFinder • RatioFinderResults• RatioFinderPlot

Just compile and launch the job(s)!

CLs = 0.05

19.11.08

Page 35: Outline

D. PiparoCMS Week 35

Another representation of the informationAnother representation of the information • Use the distributions of the test statistic.• At glance see how the hypotheses are separated.• For each mH projection of -2lnQ distribution in B only hypothesis.

19.11.08

Page 36: Outline

D. Piparo

LimitCalculatorLimitCalculator

Statistical Methods: class structuresStatistical Methods: class structures

Statistical Methods – Mother: StatisticalMethod

LimitCalculatorLimitCalculator PLScanPLScan FCCalculatorFCCalculator

LimitResultsLimitResults PLScanResultsPLScanResults FCResultsFCResults

Statistical Methods Results – Mother: StatisticalResult

LimitPlotLimitPlot PLScanPlot (add also FC curves)PLScanPlot (add also FC curves)

Statistical Plot – Mother: StatisticalPlot

ConstraintConstrBlock2ConstrBlock3ConstrBlockArray

Constraints Mother: NLLPenalty

LEPBandPlotLEPBandPlot

ExclusionBandPlotExclusionBandPlot

+

• Organisation of the classes of statistical methods:

“Sum” the results:batch/GRID jobs submission easier

Aka HybridCalculatorAka HybridCalculator

Aka HybridResultsAka HybridResults

Aka HybridPlotAka HybridPlot

19.11.08

Page 37: Outline

D. Piparo 37

The Three PartsThe Three Parts

• Analyses modeling and combination

• Statistical Methods and limits

• Graphics routines

19.11.08

Page 38: Outline

D. Piparo 38

Plots collectionPlots collection

19.11.08

Page 39: Outline

D. Piparo 39

TroubleshootingTroubleshootingQ: I want to start now. Where do I find the examples?

A: In the macros dir you find the macros for the interpreter while in the progs directory the programs to compile with the make exe command.

Q: I think I do not know how to write a datacard. How can I do?

A: In the macros directory you find some datacards to find the inspiration. Moreover check the scripts in the scripts directory. You have the create_card_skeleton.py to query for templated card components and TDR_HZZ_card_maker.py, to create the CMS PTDR H→ZZ→4l cards.

Q: I compiled RSC but ROOT does not see the dynamic library libRooStatsCms.so. What do I do?

A: Add to your LD_LIBRARY_PATH environmental variable the /RooStatsCms/lib dir. In the script directory you have the RSCenv.sh script to set up your environment. Then in the interpreter use the command gSystem->Load(“libRooStatsCms.so”).

“Q”: Still.. I cannot get it work!

A: Come down to the eight floor for support!

19.11.08

Page 40: Outline

D. Piparo 40

ConclusionsConclusions• Intuitive “model factory”

– Build the analysis model from an ASCII configuration file, the datacard– Datacard also describes nuisance parameters (and correlations)– Building of a combined model for a combined analysis

• Implementation of nuisance parameters and correlations– Can be marginalised or profiled

• Statistical methods– LimitCalculator (CLb,CLsb,CLs) Complete*– PLScan (Profile Likelihood) Complete*– FCCalculator (fully frequentist approach) Validation to complete– Bayesian approach and Markov chains Being investigated

* Strong implementation, tested and used by CMS analyses

• Batch friendly: decomposition in sub-jobs; results stored in ROOT files– Results can be merged and exploited by results classes

• Plots in a “presentation ready” form easily obtainable

19.11.08