Project Report

Summer 2016 Internship TopFitter Parallel Scan

Thomas Fletcher

[email protected] [email protected]

Abstract: The TopFitter program calculates constraints on higher dimensional operators modelling deviations from the Standard Model, specifically with regards to top quarks. The aim of the summer project was to create a version of TopFitter which could be massively parallelised on GPUs. The goal was accomplished and a x3.5 speedup was achieved when compared to the original version running on the data used in previously published papers. At the end of the internship there still is an unidentified bug when running on the largest data set, and a considerably more efficient version is formally complete. As a separate task, an analysis of events generated with models increasingly divergent from SM yielded the result that interference with SM interactions prevents detection of non-SM effects.

1. Introduction

1.1. Beyond the Standard Model In the search for Beyond-Standard-Model implementations of the breaking of electroweak symmetry, all data produced by the Large Hadron Collider (LHC) is usually parametrised with model-independent parameters representing its deviation from Standard Model predictions[1][2].

So far the data has consistently matched these predictions (although not definitely excluding new degrees of freedom at those energies), leading to the conclusion that, if present, larger deviations will have to occur at higher energies.

In trying to parametrise all BSM interactions, the SM Lagrangian will be just the first term in an infinite series of Lagrangian terms constructed from SM operators, constituting an Effective Lagrangian eff. Note that mass-wise (and not space-time-wise) higher-dimensional Lagrangian terms will be suppressed at high energy scales, represented by in Equation 1 below:

eff = +1

1 +

1

22 +

1

33 +

Equation 1: Effective Field Lagrangian [1][2]

Modelling the new physics with an infinite series of higher-dimensional effective operators is an approach which, among others (such as anomalous couplings)[1][2], has the advantage of being completely general, allowing the exploration of new physical effects without depending on specific models regarding wider spectra than required (because of the higher energy term

suppression)[2] and also that of preserving the SM (3) (2) (1) gauge symmetry (because the terms are combinations of SM operators)[1].

Furthermore, the infinite series collapses to a manageable finite number of terms by choosing a dimension to model, making the simple assumptions of minimal flavour violation and baryon number conservation and focusing on a specific set of observables[1][2].

The number of operators for dimension-six (where the relevant leading eff contributions appear) with the above (and more)[2] assumptions taken into account and focusing on top quarks is just 14, making the Effective Lagrangian of the form in Equation 2 below:

eff = +1

2

+ (4)

Equation 2: Specific Effective Field Lagrangian, where the Ci are arbitrary Wilson Coefficients and Oi the 14 relevant operators [1][2]

1.2. The TopFitter Collaboration Given the great abundance of top quark data from LHC and Tevatron and their important role

in most Standard Model deviations, the TopFitter Collaboration was set up to compute constraints on the operators which contribute to top quark events.

The Collaborations previous work constrained dimension-six operators contributing to single and pair top quark production; the number of relevant operators (14), although greatly narrowed down by the aforementioned (and more) assumptions and choices, was still not manageable by the original TopFitter, which had to set at least half of them to 0 in order to be able run.[1][2]

Moving forward with this research we cannot afford to ignore further dimensions, but the computation scaling represents a significant obstacle; this is why the original code needed to be optimised and then either run on supercomputers or heavily parallelised and run on GPUs.

1.3. Data flow from source to TopFitter The data TopFitter works on comes from a multi-step data flow through software packages performing Monte Carlo events generation and analyses as shown in Diagram 1 below:

Diagram 1: Data Flow into TopFitter

2. Side Project: Colour Flow Analysis

2.1. Non-Standard Colour Flow A separate task in the project was that of analysing event files in order to confirm that they indeed contained the intended non-standard colour flows which were supposed to be generated by Monte Carlo engines using models increasingly divergent from the Standard Model.

Obviously, colour has to be conserved between inputs and outputs of Feynman diagram vertices in the same way baryon number and similar quantities are; non-standard colour flow

occurs when the total colour before and after an event is not conserved or it is but the pairings and allocations of colours among the outputs do not reflect the standard event reconstruction.

A comparison of a SM and a non-SM event resulting in non-SM colour flow can be seen in Diagram 2 below, where the first Feynman diagram has a vertex producing a bottom quark and a (colourless) W+ boson producing a colour-matched quark pair, while the second diagram has a black box vertex in its place, directly outputting the bottom quark and a non-colour-matched quark pair (note that, instead, it is the bottom quark which matches one of the pair quarks colour; regardless of the match, total colour is not conserved).

Diagram 2: Comparison of Standard and Non-Standard events with an emphasis on colour flow

(See colour coded legend at the top of diagram for indices)

2.2. Analysis Results A C++ program making use of the LHEF library[3] was written in order to isolate only the relevant pairs of quarks from each event and then analyse their colours.

Two kinds of event files were fed to the program: some generated from only non-Standard Models and others generated from both Standard and non-Standard Models at the same time.

Many instances of the expected non-Standard colour flows were found in the former, but surprisingly, none of them were found in the latter.

A brief discussion with the Theory Group confirmed that this empirically found absence of non-SM colour flows is backed by theory: the Standard Model interactions interfere heavily with the non-SM ones, effectively preventing their effects from manifesting.

This implies that the flows in question will never be observed, meaning that a different approach will have to be used to test these non-SM models.

3. Project Steps

3.1. TopFitter Structure The general structure of the main TopFitter script (both before and after parallelisation) is the following:

1. Extract data from the input files and package it into useful objects 2. Choose a pair of dimensions to slice through the given n-dimensional space 3. Generate 2D a slice grid in the chosen dimensions for each pixel of the intended output 4. Pre-scan each grid point with a marginalisation function in order to find local minima 5. Find the global minimum starting from the smallest local minimum

Step 4 involves picking evenly spaced points in the resulting (n-2)-dimensional space, and computing the chi squared function on each of them, making it the most computationally expensive step, and therefore the parallelisation target.

3.2. Complexity and Scaling Table 1 below shows the step-by-step algorithmic complexity of TopFitters scans, i.e. how quickly the computation increases for increasing input size.

Definition Expression

Dimensions Pixels = Number of Slice Grids

Scanned Points per Dimension (= 5 by default) Number of Slice Grid Points = 2 Operations per Point = (( 2)!)

Total Operations = = 2 (( 2)!) Table 1: Algorithmic complexity of TopFitters scanning step

For reference, using P = 121 and S = 5:

D = 7 N = 3125 and O = f(120), making T = 378125 f(120) D = 12 N = 9765625 and O = f(3628800), making T = 1181640625 f(3628800)

It is obvious that an algorithmic complexity of an exponential times a function of a factorial

leads to impossibly long computations extremely quickly, and even the GPU used for this project (6 GB of RAM, 2816 Cores, 1.19 GHz Max Clock rate and 1024 Max Threads per block) will struggle at D = 12.

3.3. Used Frameworks TopFitter is written in Python, and makes use of the Professor2 package[4] (which was developed alongside TopFitter and shares some code with it), which is a tuning tool for Monte Carlo event generators written in Python, Cython and C++, in order to extract data from the statistical analyses input files and present it in the form of a variety of easy to query histogram objects.

The function marginalisation for each data point is carried out with the iMinuit package, which needs very specific inputs (such as pre-emptively known variable names etc), restricting the flexibility of the whole codebase.

The parallelisation over GPU cores is made possible by the PyOpenCL package[5], which is an interface to OpenCL drivers for multi-core hardware; the useful features of PyOpenCL are:

It interfaces very well with the Numpy package (which TopFitter already makes use of), allowing easy transfer of Python data onto the parallel device

It gets rid of most of the boilerplate code from the OpenCL C equivalent code, handling all the setting up and environment details of the device

It provides a few common parallel-algorithm-building tools which leave only the innermost kernel of computation to be written by the user

OpenCL itself imposes many restrictions on the C code that can run on the device, the most important ones being:

Pointers to pointers (most importantly multidimensional arrays) are not allowed, meaning that the user has to flatten data structures (PyOpenCL takes care of that automatically from Numpy arrays) and then go through them with size modulo arithmetic.

Variable length declarations are not allowed, meaning that if necessary the user has to use the precompiler or string substitution from the Python side in order to get around this.

3.4. Generalisation to N dimensions The original TopFitter had hardcoded blocks dealing with each specific number of input dimensions because of the aforementioned requirements of iMinuit with regards to pre-emptive input knowledge; the first task was therefore that of generalising the code to N dimensions.

This consisted in procedurally generating variable names and argument counts, with the interface to iMinuit becoming a locally generated and executed code string containing a function declaration with all the required behaviours, while in fact making use of more generic functions.

3.5. Data Extraction, Caching and Transfer to GPU The marginalisation step uses data coming from two different Professor2 Histogram objects: DataHisto and IpolHisto. The former contains static data for each bin, while the latter contains all the interpolation information required to calculate a value for each n-dimensional coordinates tuple, meaning that calling, for example, the value method on an IpolBin with the coordinates as arguments triggers a series of computations going through Python, Cython, C++ and back again.

While the original TopFitter could afford to extract or calculate each data item from the Professor2 Histogram objects in the same cycle as the marginalisation, the parallel version cannot because all the data has to be cached on the GPU memory to be used by each core independently.

Since OpenCL does not allow pointers to pointers, the only ways to store the required data are (eventually flattened) arrays or some form of OpenCL compliant C structs (which are allowed).

The former is simpler, and was therefore chosen as the preferred method; in the (common) case of multiple histograms in the input files, all the data is concatenated into a single array per

item type, simplifying the parallelisation process; the length of these arrays is therefore just the total number of bins irrespective of histograms, and its value is referred to as binsLen in the code.

If the --parallel flag is detected, then TopFitter needs to extract all the required data and cache it on the parallel device; this is straightforward method calling on the DataBin objects, while, in order to be computationally efficient, instead of using the value methods on the IpolBin objects, some non-originally API-exposed internal IpolBin data structures were exposed in a newer version of Professor2 specifically for the benefit of the parallel TopFitter implementation, allowing the caching of all the constant data items required to compute the equivalent values for IpolBin objects, with the only variables left being the coordinates.

All the extracted data is either stored as Numpy arrays first and then transferred onto the device or it is generated directly on the device if its size is known in advance; the only data types used are Numpys own intc and float64, as they are guaranteed to be the equivalent of Cs int and double.

The final arrays transferred on the device are the following (the leading a indicates that the variable is an array) (from TopFitter/tf/kernelCode.py):

# Array lengths: # aChi2s: gridLen # aGrid: 2D Array (gridLen x polyDim) # aCoorMins, aCoorMaxs: polyDim # aDBVals, aDBErrs, aMaxErrs, aIpolRelErrs: binsLen # aPolyCoeffss: 2D Array (binsLen x polyLen) # aPolyStruct: 2D Array (polyLen x polyDim) # aErrsNums: binsLen # aErr0Coeffss: 2D Array (binsLen x err0Len) # aErr0Struct: 2D Array (err0Len x polyDim) # aErr1Coeffss: 2D Array (binsLen x err1Len) # aErr1Struct: 2D Array (err1Len x polyDim)

aChi2s is the result array, containing the chi squared value for each slice grid point (in total gridLen items)

aGrid is the array of N-tuples of coordinates, polyDim being the dimension of the interpolation polynomial (= N)

aCoorMins, aCoorMaxs are the IpolHisto minimum and maximum coordinates values

aDBVals, aDBErrs, aMaxErrs and aIpolRelErrs are the readily available DataBin values aPolyCoeffss is the list of each of the polyLen polynomial terms coefficients for each of the

binsLen IpolBin objects interpolation polynomials aPolyStruct is a list of lists of 0s and 1s representing whether each of the polyLen

coordinates is a factor of each of the polyDim interpolation polynomial coefficients; this structure is shared by all bins

aErrsNums is a list of 0s, 1s or 2s representing the number of error interpolation polynomials for each of the binsLen IpolBins

aErr0Coeffss, aErr0Struct, aErr0Coeffss and aErr0Struct are the same structures as aPolyCoeffss and aPolyStruct but for the error interpolation polynomials, of which there might be 0, 1 or 2.

All the scalar values used in the computations, including the lengths of the above arrays are not passed directly to the kernel as arguments, but, in order to reduce I/O, are instead procedurally hardcoded into the C code strings in the same way the precompiler would use define directives.

3.6. Parallel Kernel and Preamble Having cached all the internal Professor2 IpolBin data structures as arrays, the Python, Cython and C++ computations making use of them also had to be replicated in OpenCL-restricted C and implemented in the parallel kernel.

The main structural differences between the original code and the kernels C stem from the use of modulo arithmetic in order to get to specific elements of flattened multidimensional arrays.

In the end, the parallel kernel replicates all the TopFitter normal scan with all the Professor2 background calculations on IpolBin values.

The current version of the program implements a map-chi2-then-find-minimum algorithm, meaning that the chi squared calculations results are computed in parallel and stored on the device, and after they are all done, a second pass finds their minimum.

A considerably more efficient version is formally complete but not yet working (therefore commented out throughout the codebase) and is mentioned in section 5.

4. Project Results

4.1. Final Project File Structure Main program: tf-scan2d-chi2 (Python, 304 lines)

Imports: chi2.py (Python, 55 lines) [Professor2 chi squared functions]

Imports: dataExtraction.py (Python, 103 lines) [Professor2 data extraction and histogram objects building]

Imports: debugEffects.py (Python, 61 lines) [Generic debugging prints and graphs]

Imports: parallelScanning.py (Python, 305 lines) [Professor2 histogram objects data extraction and transformation for PyOpenCL, PyOpenCL context setup, data transfer to GPU, computation & result retrieval] o Imports: kernelCode.py (Python, 30 lines; OpenCL-restricted C, 238 lines)

[Parallel Kernel OpenCL-restricted C code & minor precompiler instructions]

4.2. Performance comparison with original version Table 2 below compares the original and parallel algorithm structures:

Table 2: Algorithm structures comparison

Performance wise, on the 7-dimensional test data, the parallel version achieved a x3.5 speedup compared with the original, and the time benefit increases (asymptotically to a limit imposed by the GPUs specifics) with the given load, when the actual processing per core takes significantly longer than its I/O.

Unfortunately, at the end of the summer project (15/07/2016), the parallel version did not yet work on the largest (intended) 12-dimensional data set, returning a null value for each point; there probably is some memory related bug happening at runtime on the GPU.

5. Conclusion & Future Steps The project was successful, and it is now one bug away from running on the intended data set, which could not be run on at all by the original TopFitter.

Apart from fixing said bug, the next obvious step is to fix a very low level bug passed through OpenCL and PyOpenCL when trying to run a more efficient version of the algorithm (see section 3.6 for current version), which is a map-chi2-reduce-with-minimum algorithm, meaning that no extra memory has to be allocated for a results array, since each value is processed as soon as it is ready: when a chi squared is calculated in parallel, it is immediately compared to the current minimum and then it either replaces it or it is discarded; this will save 1/D of the grid memory and a considerable amount of I/O.

Looking further ahead, TopFitter could easily become a universal tool for fitting data in parallel, beyond top quarks, perhaps being distributed along with its co-developed project Professor2.

References [1] Buckley, A., Englert, C., Ferrando, J., Miller, D. I., Moore, L., Russell, M., and White, C. D. (2015) Global

fit of top quark effective theory to data. Physical Review D, 92, 091501(R)

[2] Buckley, A., Englert, C., Ferrando, J., Miller, D. J., Moore, L., Russell, M., and White, C. D. (2016) Constraining top quark effective theory in the LHC run II era. Journal of High Energy Physics, 2016, 15. (doi:10.1007/JHEP04(2016)015)

[3] http://home.thep.lu.se/~leif/LHEF/

[4] http://professor.hepforge.org/

[5] https://mathema.tician.de/software/pyopencl/

Documents

Project Report