Upload
butest
View
363
Download
1
Tags:
Embed Size (px)
Citation preview
Machine Learning with TMVAA ROOT based Tool for Multivariate Data Analysis
DESY Computing SeminarHamburg 14.1.2008
The TMVA developer team:The TMVA developer team: Andreas Höcker, Peter Speckmeyer, Jörg Stelzer, Helge Voss
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 2
General Event Classification Problem
Event described by k variables (that are found to be discriminating) (xi) ∈ ℜk
Events can be classified into n categories: H1 … Hn
General classifier: f: ℜk ℵ, (xi) {1,…,n}
TMVA: only n=2Commonly the case in HEP (signal/background)
Most classification methods f: ℜk ℜd, (xi)(yi)
Further: ℜd ℵ, (yi){1,…,n}
TMVA: d=1 y≥ysep: signal, y<ysep: background
H2
H1
x1
x2
H3
Example: k=2, n=3
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 3
General Event Classification Problem
Example: k=2 variables x1,2, n=3 categories H1, H2, H3
The problem: How to draw the boundaries between H1, H2, and H3 such that f(x) returns the true nature of x with maximum correctness
H2
H1
x1
x2
H3
Non-linear Boundaries ?
H2
H1
x1
x2
H3
Linear Boundaries ?
H2
H1
x1
x2
H3
Rectangular Cuts ?
Simple example I can do it by hand.
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 4
Large input variable space, complex correlations: manual optimization very difficult
2 general ways to build f(x):Supervised learning: in an event sample the category of each event is known. Machine adapts to give the smallest misclassification error on training sample.Unsupervised learning: the correct category of each event is unknown. Machinery tries to discover structures in the dataset
All classifiers in TMVA are supervised learning methods
General Event Classification Problem
1. What is the optimal boundary f(x) to separate the categories2. More pragmatic: Which classifier is best to find this optimal
boundary (or estimates it closest)
Machine Learning
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 5
Classification Problems in HEP
In HEP mostly two class problems – signal (S) and background (B)Event level (Higgs searches, …) Cone level (Tau-vs-jet reconstruction, …)Track level (particle identification, …) Lifetime and flavour tagging (b-tagging, …)...
Input informationKinematic variables (masses, momenta, decay angles, …) Event properties (jet/lepton multiplicity, sum of charges, …)Event shape (sphericity, Fox-Wolfram moments, …)Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …)…
Classifiers in Classifiers in TTMVAMVA
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 7
Rectangular Cut OptimizationIntuitive and simple: rectangular volumes in variable space
Technical challenge: cut optimization:MINUIT fit: (simplex) was found not to be reliableMonte Carlo sampling:
random scanning of parameter spaceinefficient for large number of input variables
Genetic algorithm: preferred methodSamples of cut-sets (a population) are evaluated, the fittest individuals are cross-bred (including mutation) to create a new generation
The Genetic Algorithm can also be used as standalone optimizer, outside the TMVA frameworkSimulated annealing: still need to optimize its performance
Simulated slow cooling of metal, introduce temperature dependent perturbation probability to recover from local minima
Cuts usually benefit from prior decorrelation of cut variables
{ } [ ]{ }}variables{
max,min,cut ,)(1,0)(∈
⊂=∈v
vvv xxixiy
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 8
Projective Likelihood Estimator (PDE)Probability density estimators for each input variable combined in likelihood estimator
Optimal MVA approach, if variables are uncorrelatedIn practice rarely the case, solution: de-correlate input or use different method
Reference PDFs are automatically generated from training data:Histograms (counting), splines (order 2,3,5), or unbinned kernel estimator
Output of likelihood estimator often strongly peaked at 0 and 1. To ease output parameterization TMVA applies inverse Fermi transformation.
( ){ }∏
∈
=+
=variables
,//Lh )()(,)()(
)()(
vvvBSBS
BS
S ixpiLiLiL
iLiy
Reference PDF’s
( )1)(ln)(' 1lh
1lh −−= −− iyiy τ
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 9
Estimating PDF KernelsTechnical challenge: how to estimate the PDF shapes
3 ways:
We have chosen to implementnonparametric fitting in TMVA
Binned shape interpolation usingspline functions (orders: 1, 2, 3, 5)
Unbinned kernel density estimation(KDE) with Gaussian smearing
TMVA performs automaticvalidation of goodness-of-fit
Easy to automate, can create artefacts/suppress information
Difficult to automate for arbitrary PDFs
parametric fitting (function) nonparametric fitting event counting
Automatic, unbiased, but suboptimal
original distribution is Gaussian
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 10
Multidimensional PDEExtension of the one-dimensional PDE approach to n dimensions
Counts signal and background reference events (training sample) in the vicinity V of the test event
Volume V definition:Size: fixed (defined by the data: % of Max-Min or RMS)or adaptive (define by number of events in search volume)Shape: box or ellipsoid
Improve yPDERS estimate within V by using various n-D kernel estimators (function of the (normalized) distance between test- and reference events)
Practical challenges:Need very large training sample (curse of dimensionality of kernel based methods)No training, slow evaluation.
Search speed improvement with kd-tree event sorting
∏
∈
=+
=V(i)ineventsS/Breference
/PDERS )(,)()(
)()(
e
eBSBBSS
SS winNinNin
Niniy H1
H0
x1
x2
test event
Carli-Koblitz, NIM A501, 576 (2003)
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 11
Projection:
Fisher’s Linear Discriminant AnalysisWell-known, simple and elegant MVA method
Fisher analysis determines an axis in the input variable hyperspace (F1,…,Fn, such that a projection of events onto this axis separates signal and background as much as possible
Optimal for linearly correlated Gaussian variables with different S and B meansVariable v with the same S and B sample mean Fv=0
∑∈
+=}variables{
0Fi )()(v
vv ixFFiy
( ) BSl
lBlSvlBS
BSv xxW
NN
NNF CCW +=−
+= ∑
∈
− ,}variables{
,,1Fisher
Coefficients:
W: sum of S and B covariance matrices
classifier: Function discriminant analysis (FDA)
Fit any user-defined function of input variables requiring that signal events return 1 and background 0
Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 12
Artificial Neural Network (ANN)Multilayer perceptron: fully connected, feed forward, k hidden layers
ANNs are non-linear discriminantsNon linearity from activation function. (Fisher is an ANN with linear activation function)
Training: back-propagation methodRandomly feed signal and background events to MLP and compare the desired output {0,1} with the received output (0,1): ε = d - rCorrect weights, depending on ε and learning rate η
1
i
. . .N
1 input layer k hidden layers 1 ouput layer
1
j
M1
. . .
. . . 1
. . .Mk
Nva
r dis
crim
inat
ing
in
put
va
riab
les
11w
ijw
1 jw. . .. . .
1( ) ( ) ( ) ( 1)
01
kMk k k k
j j ij ii
x w w xA−
−
=
= + ⋅
∑var
(0)1..i Nx =
1 output variable
y’j
Typical activation function A
v’j
Weierstrass theorem:MLP can approximate every continuous function to arbitrary precision with just one layer and infinite number of nodes
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 13
Boosted Decision Trees (BDT)A DT is a series of cuts that split sample set into ever smaller sets, leafs are assigned either S or B status
Classifies events by following a sequence of cuts depending on the events variable content until a S or B leaf
GrowingEach split try to maximizing gain in separation (Gini-index)
DT dimensionally robust and easy to understand but not powerful
1. PruningBottom-up pruning of a decision treeProtect from overtraining by removing statistically insignificant nodes
S,BS1,B1
S2,B2 22
22
11
11
BS
BS
BS
BSGini
++
+=
2. Boosting (Adaboost)Increase the weight of incorrectly identified events build new DTFinal classifier: ‘forest’ of DT’s linearly combinedLarge coefficient for DT with small misclassificationImproved performance and stability
BDT requires only little tuning to achieve good performance
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 14
Predictive Learning via Rule Ensembles (RuleFit)Following RuleFit approach by Friedman-PopescuModel is linear combination of rules, where a rule is a sequence of cuts defining a region in the input parameter space
The problem to solve isCreate rule ensemble: use forest of decision trees either from a BDT, or from a random forest generator (TMVA)Fit coefficients am, bk, minimizing risk of misclassification (Friedman et al.)
Pruning removes topologically equal rules” (same variables in cut sequence)
Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003
( ) ( )RF 01 1
ˆ ˆR RM n
m m k km k
x x xy a ra b= =
= + +∑ ∑rr
rules (cut sequence rm=1 if all cuts satisfied, =0 otherwise)
normalised discriminating event variables
RuleFit classifier
Linear Fisher termSum of rules
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 15
Support Vector Machine
Find hyperplane that between linearly separable signal and background (1962)
Best separation: maximum distance (margin) between closest events (support) to hyperplaneWrongly classified events add extra term to cost-function which is minimized
Non-linear cases:Transform variables into higher dimensional space where again a linear boundary (hyperplane) can separate the data (only mid-’90)
Explicit transformation form not required, cost function depends on scalar product between events: use Kernel Functions to approximate scalar products between transformed vectors in the higher dimensional space
Choose Kernel and fit the hyperplane using the linear techniques developed above
Available Kernels: Gaussian, Polynomial, Sigmoid
x1
x2
margin
support vectors
Sep
arab
le d
ata
optimal hyperplane
x1
x2
x1
x3
x1
x2
Non
-sep
arab
le d
ata
φ(x1,x2)
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 16
Data Preprocessing: DecorrelationVarious classifiers perform sub-optimal in the presence of correlations between input variables (Cuts, Projective LH), others are slower (BDT, RuleFit)
Removal of linear correlations by rotating input variablesDetermine square-root C′ of covariance matrix C, i.e., C = C′C′Transform original (xi) into decorrelated variable space (xi′) by: x′ = C ′−1x
Also implemented Principal Component Analysis (PCA)
Note that decorrelation is only complete, ifCorrelations are linearInput variables are Gaussian distributedNot very accurate conjecture in general
original SQRT derorr. PCA derorr.
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 17
Is there a best ClassifierPerformance
In the presence/absence of linear/nonlinear correlations
SpeedTraining / evaluation time
Robustness, stabilitySensitivity to overtraining, weak input variablesSize of training sample
Dimensional scalabilityDo performance, speed, and robustness deteriorate with large dimensions
ClarityCan the learning procedure/result be easily understood/visualized
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 18
No Single Best
SVM
Curse of dimensionality
Classifiers
Clarity
Robust-ness
Speed
Perfor-mance
Criteria
MLP
BDT
RuleFit
H-Matrix
Fisher
Weak input variables
/
PDERS/ k-NN
Overtraining
Response
Training
nonlinear correlations
no / linear correlations
Cuts
Likeli-hood
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 19
What is TMVAMotivation:
Classifiers perform very different depending on the data, all should be tested on a given problem
Situation for many year: usually only a small number of classifiers were investigated by analystsNeeded a Tool that enables the analyst to simultaneously evaluate the performance of a large number of classifiers on his/her dataset
Design Criteria: Performance and Convenience (A good tool does not have to be difficult to use)
Training, testing, and evaluation of many classifiers in parallelPreprocessing of input data: decorrelation (PCA, Gaussianization)Illustrative tools to compare performance of all classifiers (ranking of classifiers, ranking of input variable, choice of working point)Actively protect against overtrainingStraight forward application to test data
Special needs of high energy physics addressedTwo classes, events weights, familiar terminology
A typical TMVA analysis consists of two main steps:
9. Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition
11. Application phase: using selected trained classifiers to classify unknown data samples
Using Using TTMVAMVA
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 21
Technical AspectsTMVA is open source, written in C++, and based on and part of ROOT
Development on SourceForge, there is all the information http://sf.tmva.netBundled with ROOT since 5.11-03
Training requires ROOT-environment, resulting classifiers also available as standalone C++ code
Six core developers, many contributors> 1400 downloads since Mar 2006 (not counting ROOT users)Mailing list for reporting problems
Users Guide at http://sf.tmva.net: 97p., classifier descriptions, code examples
arXiv physics/0703039
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 22
Training with TMVAUser usually starts with template TMVAnalysis.C
Choose training variablesChoose input dataSelect classifiers (adjust training options – described in the manual by specifying option ‘H’)
Template TMVAnalysis.C (also .py) available at $TMVA/macros/ and $ROOTSYS/tmva/test/
TMVA GUI
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 23
Evaluation results ranked by best signal efficiency and purity (area)------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi-
Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance:------------------------------------------------------------------------------
Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682
------------------------------------------------------------------------------Testing efficiency compared to training efficiency (overtraining check)
------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from training sample)
Methods: @B=0.01 @B=0.10 @B=0.30------------------------------------------------------------------------------
Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873)MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873)LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872)PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881)RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848)HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868)BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911)CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715)Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677)
-----------------------------------------------------------------------------
Evaluation OutputB
ette
r cl
assi
fier Remark on overtraining
Occurs when classifier training becomes sensitive to the events of the particular training sample, rather then just to the generic featuresSensitivity to overtraining depends on classifier: e.g., Fisher insensitive, BDT very sensitiveDetect overtraining: compare performance between training and test sampleCounteract overtraining: e.g., smooth likelihood PDFs, prune decision trees, …
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 24
More Evaluation Output
--- Fisher : Ranking result (top variable is best ranked)--- Fisher : ------------------------------------------------------------------- Fisher : Rank : Variable : Discr. power--- Fisher : ------------------------------------------------------------------- Fisher : 1 : var4 : 2.175e-01--- Fisher : 2 : var3 : 1.718e-01--- Fisher : 3 : var1 : 9.549e-02--- Fisher : 4 : var2 : 2.841e-02--- Fisher : ----------------------------------------------------------------B
ette
r va
riabl
e
--- Factory : Inter-MVA overlap matrix (signal):--- Factory : --------------------------------- Factory : Likelihood Fisher--- Factory : Likelihood: +1.000 +0.667--- Factory : Fisher: +0.667 +1.000--- Factory : ------------------------------
Input Variable Ranking
Classifier correlation and overlap
how useful is a variable?
• do classifiers perform the same separation into signal and background? If two classifiers have similar performance, but significant non-overlapping classifications check if you can combine them!
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 25
Graphical EvaluationClassifier output distributions for independent test sample:
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 26
Graphical EvaluationThere is no unique way to express the performance of a classifier several benchmark quantities computed by TMVA
Signal eff. at various background effs.(= 1 – rejection) when cutting on classifieroutput
The Separation:
“Rarity” implemented (background flat):Comparison of signal shapes between differentclassifiersQuick check: background on data should be flat
( ) 2ˆ ˆ( ) ( )1ˆ ˆ2 ( ) ( )S B
S B
y y y ydy
y y y y
−+∫
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 27
Visualization Using the GUIProjective likelihood PDFs, MLP training, BDTs, …
average no. of nodes before/after pruning: 4193 / 968
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 28
Choosing a Working PointDepending on the problem the user might want to
Achieve a certain signal purity, signal efficiency, or background reduction, orFind the selection that results in the highest signal significance (depending on the expected signal and background statistics)
Using the TMVA graphical output one can determine at which classifier output value he needs to cuts to separate signal from background
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 29
Applying the Trained ClassifierUse the TMVA::Reader class, example in TMVApplication.C:
Set input variablesBook classifier with the weight file (contains all information)Compute classifier response inside event loop use it
Also standalone C++ class without ROOT dependence
Templates TMVApplication.C ClassAplication.C available at $TMVA/macros/ and $ROOTSYS/tmva/test/
std::vector<std::str ing> inputVars; …classif ier = new ReadMLP ( inputVars ) ;
for ( int i=0; i<nEv; i++) { std::vector<double> inputVec = …; double retval = classif ier ->GetMvaValue( *inputVec );}
from ClassApplication.C
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 30
Extending TMVAA user might have an own implementation of a multivariate classifier, or wants to use an external one
With ROOT 5.18.00 (16.Jan.08) user can seamlessly evaluate and compare his own classifier within TMVA:
Requirement: An own class must be derived from TMVA::MethodBase and must implement the TMVA::IMethod interface
The class must be added to the factory via ROOT’s plugin mechanism
Training, testing, evaluation, and comparison can then be done as usual,
Example in TMVAnalysis.C
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 31
A Word on Treatment of Systematics?
Some things could be done:Example: var4 may in reality have a shifted central value and hence a worse discrimination power
One can: ignore the systematic in the trainingvar4 appears stronger in training than it might besuboptimal performance (bad training, not wrong)Classifier response will strongly depend on “var4”, and hence will have a larger systematic uncertainty
Better: Train with shifted (weakened) var4
Then evaluate systematic error on classifier output
There is no principle difference in systematics evaluation between single discriminating variables and MV classifiers
Control sample to estimate uncertainty on classifier output (not necessarily for each input variable)
• Advantage: correlations automatically taken into account
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 32
Conclusion RemarksMultivariate classifiers are no black boxes, we just need to understand them
Cuts and Likelihood are transparent if they perform use themIn presence of correlations other classifiers are better
Difficult to understand at any rate
Enormous acceptance growth in recent decade in HEP
TMVA provides means to train, evaluate, compare, and apply different classifiers
TMVA also tries – through visualization – improve the understanding of the internals of each classifier
Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. We thank in particular the CERN Summer students Matt Jachowski (Stanford) for the implementation of TMVA's new MLP neural network, Yair Mahalalel (Tel Aviv) for a significant improvement of PDERS, and Or Cohen for the development of the general classifier boosting, the Krakow student Andrzej Zemla and his supervisor Marcin Wolter for programming a powerful Support Vector Machine, as well as Rustem Ospanov for the development of a fast k-NN algorithm. We are grateful to Doug Applegate, Kregg Arms, René Brun and the ROOT team, Tancredi Carli, Zhiyi Liu, Elzbieta Richter-Was, Vincent Tisserand and Alexei Volk for helpful conversations.
OutlookOutlook
Primary development from this Summer: Generalized classifiers
3. Be able to boost or bag any classifier4. Combine any classifier with any other classifier using any
combination of input variables in any phase space region1. is ready – now in testing mode. To be deployed after upcoming ROOT release.
A Few Toy Examples
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 35
Checker Board ExamplePerformance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiersAfter specific tuning, also SVM und MLP perform well
Theoretical maximum
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 36
Linear-, Cross-, Circular CorrelationsIllustrate the behavior of linear and nonlinear classifiers
Linear correlations(same for signal and background)
Linear correlations(opposite for signal and background)
Circular correlations(same for signal and background)
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 37
Linear-, Cross-, Circular CorrelationsPlot test-events weighted by classifier output (red: signal-like, blue: background-like)
Linear correlations(same for signal and background)
Cross-linear correlations(opposite for signal and background)
Circular correlations(same for signal and background)
LikelihoodLikelihood - DPDERSFisherMLPBDT
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 38
Final PerformanceBackground rejection versus signal efficiency curve:
LinearExampleCrossExampleCircularExample
Additional Information
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 40
Stability with Respect to Irrelevant VariablesToy example with 2 discriminating and 4 non-discriminating variables:
use only two discriminant variables in classifiersuse all discriminant variables in classifiers
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 41
TMVAnalysis.C Script for Trainingvoid TMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" );
TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");
TFile *input = TFile::Open("tmva_example.root"); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 );
factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); factory->AddVariable("var3", 'F'); factory->AddVariable("var4", 'F');
factory->PrepareTrainingAndTestTree("", "NSigTrain=3000:NBkgTrain=3000:SplitMode=Random:!V" );
factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" );
factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" );
factory->TrainAllMethods(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;}
create Factory
give training/test trees
register input variables
train, test and evaluate
select MVA methods
DESY, Hamburg 14.1.2008 Multivariate Analysis with TMVA - Jörg Stelzer 42
TMVApplication.C Script for Applicationvoid TMVApplication( ) {
TMVA::Reader *reader = new TMVA::Reader("!Color");
Float_t var1, var2, var3, var4; reader->AddVariable( "var1+var2", &var1 ); reader->AddVariable( "var1-var2", &var2 ); reader->AddVariable( "var3", &var3 ); reader->AddVariable( "var4", &var4 );
reader->BookMVA( "MLP classifier", "weights/MVAnalysis_MLP.weights.txt" );
TFile *input = TFile::Open("tmva_example.root"); TTree* theTree = (TTree*)input->Get("TreeS");
// … set branch addresses for user TTree for (Long64_t ievt=3000; ievt<theTree->GetEntries();ievt++) { theTree->GetEntry(ievt); var1 = userVar1 + userVar2; var2 = userVar1 - userVar2; var3 = userVar3; var4 = userVar4;
Double_t out = reader->EvaluateMVA( "MLP classifier" );
// do something with it … } delete reader;}
register the variables
book classifier(s)
prepare event loop
compute input variables
calculate classifier output
create Reader