Jörg Stelzer

Machine Learning with TMVAA ROOT based Tool for Multivariate Data Analysis

DESY Computing SeminarHamburg 14.1.2008

The TMVA developer team:The TMVA developer team: Andreas Höcker, Peter Speckmeyer, Jörg Stelzer, Helge Voss

DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 2

General Event Classification Problem

Event described by k variables (that are found to be discriminating) (xi) ∈ ℜk

Events can be classified into n categories: H1 … Hn

General classifier: f: ℜk ℵ, (xi) {1,…,n}

TMVA: only n=2Commonly the case in HEP (signal/background)

Most classification methods f: ℜk ℜd, (xi)(yi)

Further: ℜd ℵ, (yi){1,…,n}

TMVA: d=1 y≥ysep: signal, y<ysep: background

H2

H1

x1

x2

H3

Example: k=2, n=3



Example: k=2 variables x1,2, n=3 categories H1, H2, H3

The problem: How to draw the boundaries between H1, H2, and H3 such that f(x) returns the true nature of x with maximum correctness

H2

H1

x1

x2

H3

Non-linear Boundaries ?

H2

H1

x1

x2

H3

Linear Boundaries ?

H2

H1

x1

x2

H3

Rectangular Cuts ?

Simple example I can do it by hand.


Large input variable space, complex correlations: manual optimization very difficult

2 general ways to build f(x):Supervised learning: in an event sample the category of each event is known. Machine adapts to give the smallest misclassification error on training sample.Unsupervised learning: the correct category of each event is unknown. Machinery tries to discover structures in the dataset

All classifiers in TMVA are supervised learning methods


1. What is the optimal boundary f(x) to separate the categories2. More pragmatic: Which classifier is best to find this optimal

boundary (or estimates it closest)

Machine Learning


Classification Problems in HEP

In HEP mostly two class problems – signal (S) and background (B)Event level (Higgs searches, …) Cone level (Tau-vs-jet reconstruction, …)Track level (particle identification, …) Lifetime and flavour tagging (b-tagging, …)...

Input informationKinematic variables (masses, momenta, decay angles, …) Event properties (jet/lepton multiplicity, sum of charges, …)Event shape (sphericity, Fox-Wolfram moments, …)Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …)…

Classifiers in Classifiers in TTMVAMVA


Rectangular Cut OptimizationIntuitive and simple: rectangular volumes in variable space

Technical challenge: cut optimization:MINUIT fit: (simplex) was found not to be reliableMonte Carlo sampling:

random scanning of parameter spaceinefficient for large number of input variables

Genetic algorithm: preferred methodSamples of cut-sets (a population) are evaluated, the fittest individuals are cross-bred (including mutation) to create a new generation

The Genetic Algorithm can also be used as standalone optimizer, outside the TMVA frameworkSimulated annealing: still need to optimize its performance

Simulated slow cooling of metal, introduce temperature dependent perturbation probability to recover from local minima

Cuts usually benefit from prior decorrelation of cut variables

{ } [ ]{ }}variables{

max,min,cut ,)(1,0)(∈

⊂=∈v

vvv xxixiy


Projective Likelihood Estimator (PDE)Probability density estimators for each input variable combined in likelihood estimator

Optimal MVA approach, if variables are uncorrelatedIn practice rarely the case, solution: de-correlate input or use different method

Reference PDFs are automatically generated from training data:Histograms (counting), splines (order 2,3,5), or unbinned kernel estimator

Output of likelihood estimator often strongly peaked at 0 and 1. To ease output parameterization TMVA applies inverse Fermi transformation.

( ){ }∏

∈

=+

=variables

,//Lh )()(,)()(

)()(

vvvBSBS

BS

S ixpiLiLiL

iLiy

Reference PDF’s

( )1)(ln)(' 1lh

1lh −−= −− iyiy τ


Estimating PDF KernelsTechnical challenge: how to estimate the PDF shapes

3 ways:

We have chosen to implementnonparametric fitting in TMVA

Binned shape interpolation usingspline functions (orders: 1, 2, 3, 5)

Unbinned kernel density estimation(KDE) with Gaussian smearing

TMVA performs automaticvalidation of goodness-of-fit

Easy to automate, can create artefacts/suppress information

Difficult to automate for arbitrary PDFs

parametric fitting (function) nonparametric fitting event counting

Automatic, unbiased, but suboptimal

original distribution is Gaussian


Multidimensional PDEExtension of the one-dimensional PDE approach to n dimensions

Counts signal and background reference events (training sample) in the vicinity V of the test event

Volume V definition:Size: fixed (defined by the data: % of Max-Min or RMS)or adaptive (define by number of events in search volume)Shape: box or ellipsoid

Improve yPDERS estimate within V by using various n-D kernel estimators (function of the (normalized) distance between test- and reference events)

Practical challenges:Need very large training sample (curse of dimensionality of kernel based methods)No training, slow evaluation.

Search speed improvement with kd-tree event sorting

∏

∈

=+

=V(i)ineventsS/Breference

/PDERS )(,)()(

)()(

e

eBSBBSS

SS winNinNin

Niniy H1

H0

x1

x2

test event

Carli-Koblitz, NIM A501, 576 (2003)


Projection:

Fisher’s Linear Discriminant AnalysisWell-known, simple and elegant MVA method

Fisher analysis determines an axis in the input variable hyperspace (F1,…,Fn, such that a projection of events onto this axis separates signal and background as much as possible

Optimal for linearly correlated Gaussian variables with different S and B meansVariable v with the same S and B sample mean Fv=0

∑∈

+=}variables{

0Fi )()(v

vv ixFFiy

( ) BSl

lBlSvlBS

BSv xxW

NN

NNF CCW +=−

+= ∑

∈

− ,}variables{

,,1Fisher

Coefficients:

W: sum of S and B covariance matrices

classifier: Function discriminant analysis (FDA)

Fit any user-defined function of input variables requiring that signal events return 1 and background 0

Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator


Artificial Neural Network (ANN)Multilayer perceptron: fully connected, feed forward, k hidden layers

ANNs are non-linear discriminantsNon linearity from activation function. (Fisher is an ANN with linear activation function)

Training: back-propagation methodRandomly feed signal and background events to MLP and compare the desired output {0,1} with the received output (0,1): ε = d - rCorrect weights, depending on ε and learning rate η

1

i

. . .N

1 input layer k hidden layers 1 ouput layer

1

j

M1

. . .

. . . 1

. . .Mk

Nva

r dis

crim

inat

ing

in

put

va

riab

les

11w

ijw

1 jw. . .. . .

1( ) ( ) ( ) ( 1)

01

kMk k k k

j j ij ii

x w w xA−

−

=

= + ⋅

∑var

(0)1..i Nx =

1 output variable

y’j

Typical activation function A

v’j

Weierstrass theorem:MLP can approximate every continuous function to arbitrary precision with just one layer and infinite number of nodes


Boosted Decision Trees (BDT)A DT is a series of cuts that split sample set into ever smaller sets, leafs are assigned either S or B status

Classifies events by following a sequence of cuts depending on the events variable content until a S or B leaf

GrowingEach split try to maximizing gain in separation (Gini-index)

DT dimensionally robust and easy to understand but not powerful

1. PruningBottom-up pruning of a decision treeProtect from overtraining by removing statistically insignificant nodes

S,BS1,B1

S2,B2 22

22

11

11

BS

BS

BS

BSGini

++

+=

2. Boosting (Adaboost)Increase the weight of incorrectly identified events build new DTFinal classifier: ‘forest’ of DT’s linearly combinedLarge coefficient for DT with small misclassificationImproved performance and stability

BDT requires only little tuning to achieve good performance


Predictive Learning via Rule Ensembles (RuleFit)Following RuleFit approach by Friedman-PopescuModel is linear combination of rules, where a rule is a sequence of cuts defining a region in the input parameter space

The problem to solve isCreate rule ensemble: use forest of decision trees either from a BDT, or from a random forest generator (TMVA)Fit coefficients am, bk, minimizing risk of misclassification (Friedman et al.)

Pruning removes topologically equal rules” (same variables in cut sequence)

Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003

( ) ( )RF 01 1

ˆ ˆR RM n

m m k km k

x x xy a ra b= =

= + +∑ ∑rr

rules (cut sequence rm=1 if all cuts satisfied, =0 otherwise)

normalised discriminating event variables

RuleFit classifier

Linear Fisher termSum of rules


Support Vector Machine

Find hyperplane that between linearly separable signal and background (1962)

Best separation: maximum distance (margin) between closest events (support) to hyperplaneWrongly classified events add extra term to cost-function which is minimized

Non-linear cases:Transform variables into higher dimensional space where again a linear boundary (hyperplane) can separate the data (only mid-’90)

Explicit transformation form not required, cost function depends on scalar product between events: use Kernel Functions to approximate scalar products between transformed vectors in the higher dimensional space

Choose Kernel and fit the hyperplane using the linear techniques developed above

Available Kernels: Gaussian, Polynomial, Sigmoid

x1

x2

margin

support vectors

Sep

arab

le d

ata

optimal hyperplane

x1

x2

x1

x3

x1

x2

Non

-sep

arab

le d

ata

φ(x1,x2)


Data Preprocessing: DecorrelationVarious classifiers perform sub-optimal in the presence of correlations between input variables (Cuts, Projective LH), others are slower (BDT, RuleFit)

Removal of linear correlations by rotating input variablesDetermine square-root C′ of covariance matrix C, i.e., C = C′C′Transform original (xi) into decorrelated variable space (xi′) by: x′ = C ′−1x

Also implemented Principal Component Analysis (PCA)

Note that decorrelation is only complete, ifCorrelations are linearInput variables are Gaussian distributedNot very accurate conjecture in general

original SQRT derorr. PCA derorr.


Is there a best ClassifierPerformance

In the presence/absence of linear/nonlinear correlations

SpeedTraining / evaluation time

Robustness, stabilitySensitivity to overtraining, weak input variablesSize of training sample

Dimensional scalabilityDo performance, speed, and robustness deteriorate with large dimensions

ClarityCan the learning procedure/result be easily understood/visualized


No Single Best

SVM

Curse of dimensionality

Classifiers

Clarity

Robust-ness

Speed

Perfor-mance

Criteria

MLP

BDT

RuleFit

H-Matrix

Fisher

Weak input variables

/

PDERS/ k-NN

Overtraining

Response

Training

nonlinear correlations

no / linear correlations

Cuts

Likeli-hood


What is TMVAMotivation:

Classifiers perform very different depending on the data, all should be tested on a given problem

Situation for many year: usually only a small number of classifiers were investigated by analystsNeeded a Tool that enables the analyst to simultaneously evaluate the performance of a large number of classifiers on his/her dataset

Design Criteria: Performance and Convenience (A good tool does not have to be difficult to use)

Training, testing, and evaluation of many classifiers in parallelPreprocessing of input data: decorrelation (PCA, Gaussianization)Illustrative tools to compare performance of all classifiers (ranking of classifiers, ranking of input variable, choice of working point)Actively protect against overtrainingStraight forward application to test data

Special needs of high energy physics addressedTwo classes, events weights, familiar terminology

A typical TMVA analysis consists of two main steps:

9. Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition

11. Application phase: using selected trained classifiers to classify unknown data samples

Using Using TTMVAMVA


Technical AspectsTMVA is open source, written in C++, and based on and part of ROOT

Development on SourceForge, there is all the information http://sf.tmva.netBundled with ROOT since 5.11-03

Training requires ROOT-environment, resulting classifiers also available as standalone C++ code

Six core developers, many contributors> 1400 downloads since Mar 2006 (not counting ROOT users)Mailing list for reporting problems

Users Guide at http://sf.tmva.net: 97p., classifier descriptions, code examples

arXiv physics/0703039

http://sf.tmva.net/

http://sf.tmva.net/

http://sf.tmva.net/

http://sf.tmva.net/


Training with TMVAUser usually starts with template TMVAnalysis.C

Choose training variablesChoose input dataSelect classifiers (adjust training options – described in the manual by specifying option ‘H’)

Template TMVAnalysis.C (also .py) available at $TMVA/macros/ and $ROOTSYS/tmva/test/

TMVA GUI


Evaluation results ranked by best signal efficiency and purity (area)------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi-

Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance:------------------------------------------------------------------------------

Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682

------------------------------------------------------------------------------Testing efficiency compared to training efficiency (overtraining check)

------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from training sample)

Methods: @B=0.01 @B=0.10 @B=0.30------------------------------------------------------------------------------

Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873)MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873)LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872)PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881)RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848)HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868)BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911)CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715)Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677)

-----------------------------------------------------------------------------

Evaluation OutputB

ette

r cl

assi

fier Remark on overtraining

Occurs when classifier training becomes sensitive to the events of the particular training sample, rather then just to the generic featuresSensitivity to overtraining depends on classifier: e.g., Fisher insensitive, BDT very sensitiveDetect overtraining: compare performance between training and test sampleCounteract overtraining: e.g., smooth likelihood PDFs, prune decision trees, …


More Evaluation Output

--- Fisher : Ranking result (top variable is best ranked)--- Fisher : ------------------------------------------------------------------- Fisher : Rank : Variable : Discr. power--- Fisher : ------------------------------------------------------------------- Fisher : 1 : var4 : 2.175e-01--- Fisher : 2 : var3 : 1.718e-01--- Fisher : 3 : var1 : 9.549e-02--- Fisher : 4 : var2 : 2.841e-02--- Fisher : ----------------------------------------------------------------B

ette

r va

riabl

e

--- Factory : Inter-MVA overlap matrix (signal):--- Factory : --------------------------------- Factory : Likelihood Fisher--- Factory : Likelihood: +1.000 +0.667--- Factory : Fisher: +0.667 +1.000--- Factory : ------------------------------

Input Variable Ranking

Classifier correlation and overlap

how useful is a variable?

• do classifiers perform the same separation into signal and background? If two classifiers have similar performance, but significant non-overlapping classifications check if you can combine them!


Graphical EvaluationClassifier output distributions for independent test sample:


Graphical EvaluationThere is no unique way to express the performance of a classifier several benchmark quantities computed by TMVA

Signal eff. at various background effs.(= 1 – rejection) when cutting on classifieroutput

The Separation:

“Rarity” implemented (background flat):Comparison of signal shapes between differentclassifiersQuick check: background on data should be flat

( ) 2ˆ ˆ( ) ( )1ˆ ˆ2 ( ) ( )S B

S B

y y y ydy

y y y y

−+∫


Visualization Using the GUIProjective likelihood PDFs, MLP training, BDTs, …

average no. of nodes before/after pruning: 4193 / 968


Choosing a Working PointDepending on the problem the user might want to

Achieve a certain signal purity, signal efficiency, or background reduction, orFind the selection that results in the highest signal significance (depending on the expected signal and background statistics)

Using the TMVA graphical output one can determine at which classifier output value he needs to cuts to separate signal from background


Applying the Trained ClassifierUse the TMVA::Reader class, example in TMVApplication.C:

Set input variablesBook classifier with the weight file (contains all information)Compute classifier response inside event loop use it

Also standalone C++ class without ROOT dependence

Templates TMVApplication.C ClassAplication.C available at $TMVA/macros/ and $ROOTSYS/tmva/test/

std::vector<std::str ing> inputVars; …classif ier = new ReadMLP ( inputVars ) ;

for ( int i=0; i<nEv; i++) { std::vector<double> inputVec = …; double retval = classif ier ->GetMvaValue( *inputVec );}

from ClassApplication.C


Extending TMVAA user might have an own implementation of a multivariate classifier, or wants to use an external one

With ROOT 5.18.00 (16.Jan.08) user can seamlessly evaluate and compare his own classifier within TMVA:

Requirement: An own class must be derived from TMVA::MethodBase and must implement the TMVA::IMethod interface

The class must be added to the factory via ROOT’s plugin mechanism

Training, testing, evaluation, and comparison can then be done as usual,

Example in TMVAnalysis.C


A Word on Treatment of Systematics?

Some things could be done:Example: var4 may in reality have a shifted central value and hence a worse discrimination power

One can: ignore the systematic in the trainingvar4 appears stronger in training than it might besuboptimal performance (bad training, not wrong)Classifier response will strongly depend on “var4”, and hence will have a larger systematic uncertainty

Better: Train with shifted (weakened) var4

Then evaluate systematic error on classifier output

There is no principle difference in systematics evaluation between single discriminating variables and MV classifiers

Control sample to estimate uncertainty on classifier output (not necessarily for each input variable)

• Advantage: correlations automatically taken into account


Conclusion RemarksMultivariate classifiers are no black boxes, we just need to understand them

Cuts and Likelihood are transparent if they perform use themIn presence of correlations other classifiers are better

Difficult to understand at any rate

Enormous acceptance growth in recent decade in HEP

TMVA provides means to train, evaluate, compare, and apply different classifiers

TMVA also tries – through visualization – improve the understanding of the internals of each classifier

Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. We thank in particular the CERN Summer students Matt Jachowski (Stanford) for the implementation of TMVA's new MLP neural network, Yair Mahalalel (Tel Aviv) for a significant improvement of PDERS, and Or Cohen for the development of the general classifier boosting, the Krakow student Andrzej Zemla and his supervisor Marcin Wolter for programming a powerful Support Vector Machine, as well as Rustem Ospanov for the development of a fast k-NN algorithm. We are grateful to Doug Applegate, Kregg Arms, René Brun and the ROOT team, Tancredi Carli, Zhiyi Liu, Elzbieta Richter-Was, Vincent Tisserand and Alexei Volk for helpful conversations.

OutlookOutlook

Primary development from this Summer: Generalized classifiers

3. Be able to boost or bag any classifier4. Combine any classifier with any other classifier using any

combination of input variables in any phase space region1. is ready – now in testing mode. To be deployed after upcoming ROOT release.

A Few Toy Examples


Checker Board ExamplePerformance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiersAfter specific tuning, also SVM und MLP perform well

Theoretical maximum


Linear-, Cross-, Circular CorrelationsIllustrate the behavior of linear and nonlinear classifiers

Linear correlations(same for signal and background)

Linear correlations(opposite for signal and background)

Circular correlations(same for signal and background)


Linear-, Cross-, Circular CorrelationsPlot test-events weighted by classifier output (red: signal-like, blue: background-like)

Linear correlations(same for signal and background)

Cross-linear correlations(opposite for signal and background)

Circular correlations(same for signal and background)

LikelihoodLikelihood - DPDERSFisherMLPBDT


Final PerformanceBackground rejection versus signal efficiency curve:

LinearExampleCrossExampleCircularExample

Additional Information


Stability with Respect to Irrelevant VariablesToy example with 2 discriminating and 4 non-discriminating variables:

use only two discriminant variables in classifiersuse all discriminant variables in classifiers


TMVAnalysis.C Script for Trainingvoid TMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" );

TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");

TFile *input = TFile::Open("tmva_example.root"); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 );

factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); factory->AddVariable("var3", 'F'); factory->AddVariable("var4", 'F');

factory->PrepareTrainingAndTestTree("", "NSigTrain=3000:NBkgTrain=3000:SplitMode=Random:!V" );

factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" );

factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" );

factory->TrainAllMethods(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;}

create Factory

give training/test trees

register input variables

train, test and evaluate

select MVA methods


TMVApplication.C Script for Applicationvoid TMVApplication( ) {

TMVA::Reader *reader = new TMVA::Reader("!Color");

Float_t var1, var2, var3, var4; reader->AddVariable( "var1+var2", &var1 ); reader->AddVariable( "var1-var2", &var2 ); reader->AddVariable( "var3", &var3 ); reader->AddVariable( "var4", &var4 );

reader->BookMVA( "MLP classifier", "weights/MVAnalysis_MLP.weights.txt" );

TFile *input = TFile::Open("tmva_example.root"); TTree* theTree = (TTree*)input->Get("TreeS");

// … set branch addresses for user TTree for (Long64_t ievt=3000; ievt<theTree->GetEntries();ievt++) { theTree->GetEntry(ievt); var1 = userVar1 + userVar2; var2 = userVar1 - userVar2; var3 = userVar3; var4 = userVar4;

Double_t out = reader->EvaluateMVA( "MLP classifier" );

// do something with it … } delete reader;}

register the variables

book classifier(s)

prepare event loop

compute input variables

calculate classifier output

create Reader

Documents

Jörg Stelzer