The Trigger and Real-time Reconstruction at LHCb...Dec 09, 2019 · The LHCb detector Located at point 8 of the LHC General-purpose detector in the forward region Specialised in studying

The Trigger and Real-time Reconstruction at LHCbCPAD Instrumentation Frontier Workshop 2019

Daniel Craikon behalf of the LHCb collaboration

Massachusetts Institute of Technology

9th December, 2019

Photo: User:Emery / Wikimedia Commons / CC-BY-SA-2.5

The LHCb detector

Located at point 8 of the LHCGeneral-purpose detector in the forward regionSpecialised in studying b- and c-decays

Instrumented in the forwardregion to exploitforward-production of c- andb-hadrons

0/4π

/2π/4π3

π

0/4π

/2π/4π3

π [rad]1θ

[rad]2θ

1θ

2θ

b

b

z

LHCb MC = 14 TeVs

Dan Craik (MIT) Trigger & Real-time Reconstruction @ LHCb 2019-12-09 1 / 23

The LHCb detector

Instrumentation in theforward region(2 < η < 5)Excellent secondaryvertex reconstructionPrecise tracking beforeand after magnetGood PID separation upto ∼ 100 GeV/c

JINST 3 (2008) S08005


https://doi.org/10.1088/1748-0221/3/08/S08005

LHCb timeline

2010 2012 2014 2016 2018 now 2022 2024 2026 2028 2030 2032

LHC HL-LHC

Run I LS 1 Run II LS 2 Run III LS 3 Run IV LS 4 Runs V+

Phase I UpgradeTriggerless readout at 40 MHz

Phase Ib UpgradePossible stepping stone

Phase II UpgradeUpgrade for HL

9 fb−1 50 fb−1 300 fb−1

Belle 250 ab−1

LHCb may be only dedicated B-physics experiment

timetable may shift


Real-time reconstruction in Run II

Hardwaretrigger

high pT, ET

1st softwaretrigger

partial reco

2nd softwaretriggerfull reco

ReconstructionAlign + Calib

Analysis40 MHz 1 MHz 100 kHz 5 kHz 5 kHz

“Online”: near detector “Offline”: grid computing

Time from collision: µs ms hours weeks

Run I:

Hardwaretrigger

high pT, ET

1st softwaretrigger

partial reco

9PB bufferReal-time

Align + Calib

2nd softwaretriggerfull reco

Analysis(Turbo)

40 MHz 1 MHz 100 kHz 100 kHz 12 kHz

Online Offline

Time from collision: µs ms hours hours

Run II:

Calibration and alignment of Run I data performed “offline” weeks after data takingTrigger reconstruction different from offline

In Run II, data buffered before final trigger stageAllows for real-time alignment and calibrationOffline-like reconstruction within the triggerMany analyses use “Turbo-stream” data – online reconstruction, full raw event notsaved



Real-time alignment andcalibration performed forvertex locator, RICHdetectors, trackingstations, calorimeter andmuon stationsWill focus on velo andRICHAlignment particularlyimportant for velo, whichopens and closesbetween fillsGas-filled RICH detectorsalso require frequentcalibration



Each alignment task performedonce per fillAlignment begins once a largeenough dataset has beencollectedCalibration of RICH gasrefractive index performedregularly to account fortemperature/pressure changeswithin the radiator gas

LHCb Preliminary LHCb Preliminary

align + calib

initial improved


Real-time alignment of the Velo

Vertex locator modules sit5 mm from the LHC beamConsists of two retractablehalves (one shown)Modules formed of twosections – one on each velohalfDuring beam injection, veloretracted to 35 mm for safetyClosed once LHC beams arestableMoves every fill→ align everyfill


Real-time alignment of the Velo

Alignment of velo based on minimisingresiduals between hits andreconstructed tracksPlot shows x and y translation betweenthe two velo halvesTolerance of ±2µm allowed withoutalignment update (empty markers)Updates may also be caused by otherdegrees of freedom

e.g. offsets or rotations within a velohalf

∆Tx

x

z

100 200Alignment number [a.u.]

20−15−10−

5−0

5

10

15

20m]

µV

aria

tion

[

x-translationy-translationLHCb VELO

Preliminary

Empty markers = no update 17/04/2018 - 21/11/2018


Real-time calibration of the RICH

250 mrad

Track

Beam pipe

PhotonDetectors

VELO exit window

SphericalMirror

PlaneMirror

C4F10

0 100 200 z (cm)

MagneticShield

RICH detectors provide particle ID information based onangle of Cherenkov radiationIndex of refraction of the gas radiators sensitive tochanges in temperature, pressure and compositionThese features are monitored but data-driven calibrationalso requiredCompare recorded and expected Cherenkov angles(bottom left)Alignment of mirrors also calibrated (bottom right)

2

FIG. 2. Side view of the LHCb RICH detector upstream of themagnet.

system, and calibrate the refractive index of radiators andthe HPD image with good precision. These factors are alltime-dependent, necessitating real-time calibration andalignment of the LHCb RICH detectors, and the trackingsystem.Calibration and alignmentCalibration of the refractive index of the RICH radi-ators

The refractive index of the gas radiators depends onthe ambient temperature and pressure, and the exactcomposition of the gas mixture; so it can change in time.These quantities are monitored by hardware to computean expected refractive index, but this does not have aprecision that is high enough for the physics analysis,therefore it needs to be further corrected. As shown inFig. 3, the distribution of the difference between the re-constructed and expected Cherenkov angle is fitted to ob-tain the shift, which is then converted to a scale factor ofthe expected refractive index according to studies basedon simulation.

About 50 Hz of events are sent to multiple online re-construction tasks, which run in parallel, and the result-ing histograms are merged at the end of each run. Thena dedicated task is used to fit the histograms merged run-by-run and produce calibration constants to be used bythe RICH reconstruction in the final stage of the softwaretrigger. The maximum run length is one hour.Calibration of the HPD images

The Hybrid Photon Detector is used to detectCherenkov photons. As shown in Fig. 4, the photoelec-tron produced at the photocathode is accelerated by ahigh voltage of up to 20 kV onto a reverse-biased pixel-lated silicon detector, with a de-magnification factor of

delta(Cherenkov Theta) / rad0.008− 0.006− 0.004− 0.002− 0 0.002 0.004 0.006 0.008

Entr

ies

1600

1800

2000

2200

2400

2600

2800

310×Rich1Gas Rec-Exp Cktheta | All photons Entries 3.643921e+08Mean 0.0002294

Std Dev 0.004334 / ndf 2χ 1711 / 90

Gaus Constant 8.645e+05Gaus Mean 0.0006738Gaus Sigma 0.001926p3 1.905e+06p4 1.09e+07p5 2.062e+09− p6 1.113e+11

Run 182551SF 1.025672

Mon Aug 29 05:41:53 2016

Rich1Gas Rec-Exp Cktheta | All photons

FIG. 3. Difference between the reconstructed and expectedCherenkov angle before the calibration.

FIG. 4. Schematic drawing of the Hybrid Photon Detector(HPD).

about 5 [6]. The HPD anode images are affected by themagnetic and electric fields, and have been observed tomove and change their size, possibly due to changes inthese residual fields when the high voltage is cycled eachLHC fill. Such changes could degrade the reconstructionof the Cherenkov angle and affect the PID performance.Therefore the centre and radius of all the HPD imagesare calibrated run-by-run. Figure 5 shows the calibra-tion process. First, the centre of the image is cleaned toeliminate ion feedback. Then a Sobel filter is used to de-tect the edges of the image that are fitted to determinethe centre and the radius of the image, which are used bythe RICH reconstruction in the final stage of the softwaretrigger. As only the raw HPD data needs to be decoded,more than 500 Hz of events are processed run-by-run.Alignment of the RICH mirror system

The Cherenkov photons emitted by the charged parti-cles passing through the RICH detectors are focused ontothe photon-detector plane by the spherical and secondarymirrors. In case of misalignment the centre of Cherenkovring would not correspond to the intersection point of thecharged track, and this would introduce a dependenceof the difference between the measured and expectedCherenkov angle on the azimuthal angle of the ring, as

3

FIG. 5. Calibration process of the HPD image. (a) HPD imagefor a typical run. (b) The centre of the HPD image is cleanedto eliminate ion feedback. (c) Edge of the HPD image detectedby the Sobel filter. (d) Radius and centre of the HPD imagereturned by the fit.

[rad]φ

0 2 4 6

[ra

d]

θ∆

-0.002

-0.0015

-0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

600

700

800

900

1000

1100

310×

[rad]φ

0 2 4 6

[ra

d]

θ∆

-0.002

-0.0015

-0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

600

700

800

900

1000

1100

310×

FIG. 6. Difference between the measured and expectedCherenkov angle, ∆θC plotted as a function of the azimuthalangle φ and fitted with θx cos(φ) + θy sin(φ), for one side ofthe RICH 2 detector [6]. The upper plot is prior to alignment,and shows a dependency of the angle θC on the angle φ. Thebottom plot is after the alignment correction, and ∆θC is uni-form in φ.

Alignment number0 5 10

Abs

olut

e va

lue

[mra

d]

0

0.05

0.1

0.15

0.2

LHCb RICHPreliminary

24/05/2016 - 24/08/2016

average local y-rotationrange of local y-rotationsaverage local z-rotationrange of local z-rotations


Abs

olut

e va

lue

[mra

d]

0

0.05

0.1

0.15

0.2


24/05/2016 - 24/08/2016


FIG. 7. Time dependence of the mirror alignment parametersfor the RICH detector downstream the magnet for the 2016 datasample, (upper) spherical mirror, (bottom) secondary mirror.The shadow regions show the range of alignment parametersfor all the mirror pairs, the points show the average value forall the mirror pairs.

shown in Fig. 6. The alignment constants for each mirrorare determined by the fit of the Cherenkov angle differ-ence as a function of the azimuthal angle on the ring.The correlation between the different mirror pairs is alsotaken into account. The procedure is evaluated by an iter-ative procedure implemented in a dedicated framework,which makes it possible to run the alignment in parallelusing about 1800 nodes of the software trigger farm. Thealignment of the RICH mirror system has been found tobe stable enough to not affect the PID performance, asshown in Fig. 7, and it runs routinely as a monitoringtask.

Time alignment

In order to maximise the photon collection efficiencyof the LHCb RICH detectors, the HPD readout must besynchronised with the LHC bunch crossing to within afew nanoseconds. The initial time alignment was per-formed in the absence of beam using a pulsed laser, andhas been improved further with dedicated timing scandata taken during physics collisions. As shown in Fig. 8,all the HPDs have been time aligned to about 1 ns.

3

FIG. 5. Calibration process of the HPD image. (a) HPD imagefor a typical run. (b) The centre of the HPD image is cleanedto eliminate ion feedback. (c) Edge of the HPD image detectedby the Sobel filter. (d) Radius and centre of the HPD imagereturned by the fit.

[rad]φ

0 2 4 6

[ra

d]

θ∆

-0.002

-0.0015

-0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

600

700

800

900

1000

1100

310×

[rad]φ

0 2 4 6

[ra

d]

θ∆

-0.002

-0.0015

-0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

600

700

800

900

1000

1100

310×

FIG. 6. Difference between the measured and expectedCherenkov angle, ∆θC plotted as a function of the azimuthalangle φ and fitted with θx cos(φ) + θy sin(φ), for one side ofthe RICH 2 detector [6]. The upper plot is prior to alignment,and shows a dependency of the angle θC on the angle φ. Thebottom plot is after the alignment correction, and ∆θC is uni-form in φ.


Abs

olut

e va

lue

[mra

d]

0

0.05

0.1

0.15

0.2


24/05/2016 - 24/08/2016



Abs

olut

e va

lue

[mra

d]

0

0.05

0.1

0.15

0.2


24/05/2016 - 24/08/2016


FIG. 7. Time dependence of the mirror alignment parametersfor the RICH detector downstream the magnet for the 2016 datasample, (upper) spherical mirror, (bottom) secondary mirror.The shadow regions show the range of alignment parametersfor all the mirror pairs, the points show the average value forall the mirror pairs.

shown in Fig. 6. The alignment constants for each mirrorare determined by the fit of the Cherenkov angle differ-ence as a function of the azimuthal angle on the ring.The correlation between the different mirror pairs is alsotaken into account. The procedure is evaluated by an iter-ative procedure implemented in a dedicated framework,which makes it possible to run the alignment in parallelusing about 1800 nodes of the software trigger farm. Thealignment of the RICH mirror system has been found tobe stable enough to not affect the PID performance, asshown in Fig. 7, and it runs routinely as a monitoringtask.

Time alignment

In order to maximise the photon collection efficiencyof the LHCb RICH detectors, the HPD readout must besynchronised with the LHC bunch crossing to within afew nanoseconds. The initial time alignment was per-formed in the absence of beam using a pulsed laser, andhas been improved further with dedicated timing scandata taken during physics collisions. As shown in Fig. 8,all the HPDs have been time aligned to about 1 ns.

NIM A 12 (2016) 041


https://doi.org/10.1016/j.nima.2016.12.041

The turbo stream

Save only parts of the eventneeded for offline analysisMultiple persistence levels

Only candidate (∼ 7 kB)Part of event (∼ 16 kB)Full event (∼ 48 kB)cf. Non-turbo (∼ 69 kB)

[Comput.Phys.Commun. 208 (2016) 35], applied in LHCb since 2015

Turbo at LHCb

REAL TIME ANALYSIS

23

It exploits the event topology and saves only a subset of the objects which are relevant for a posterior analysis. One can use several persistence levels:

Used by many analyses, e.g.

]2c) [MeV/+π0D(m2005 2010 2015 2020

)2 cC

andi

date

s / (

0.1

MeV

/

0

1000

2000

3000

4000

5000

6000

310×

Data

+K−K → 0D

Comb. bkg.

LHCb

]2c) [MeV/+π0D(m2005 2010 2015 2020

)2 cC

andi

date

s / (

0.1

MeV

/

0200400600800

1000120014001600180020002200

310×

Data

+π−π → 0D

Comb. bkg.

LHCb

310 410 510

210

310

410

510

610

710

m(µ+µ−) [MeV ]

Can

didates/σ[m

(µ+µ−)]/2

LHCb√s = 13TeV

prompt µ+µ−

µQµQhh+ hµQ

⇒ isolationapplied

prompt-like sample

pT(µ) > 1GeV, p(µ) > 20GeV

]2c) [MeV/++ccΞ(candm3500 3600 3700

2 cC

andi

date

s pe

r 5

MeV

/

0

20

40

60

80

100

120

140

160

180

DataTotalSignalBackground

LHCb 13 TeV

CP violation in charm decays Search for dark photons decaying to dimuons Observation of Ξ++ccPRL 122 (2019) 211803 PRL 120 (2018) 061801 PRL 119 (2017) 112001

JINST 14 (2019) P04006


https://doi.org/10.1103/PhysRevLett.122.211803https://doi.org/10.1103/PhysRevLett.120.061801https://doi.org/10.1103/PhysRevLett.119.112001https://doi.org/10.1088/1748-0221/14/04/P04006

The LHCb detector: Run III upgrade

05/07/2018

Why do we want to upgrade for Run III?• We currently level our luminosity at

• Huge gains available if we can run at  higher luminosities

• Why do we run at lower luminosity? • Design choices for our physics programme • Detector and trigger limitations  

• Note that upgrading for Run 3 is before the HL-LHC era in Run 4 onwards

6

R. Quagliani 20th March 2018 5

LHCb in Run I (and Run II)

LHCb Run I

LHCb upgrade

LHCb limitations and upgrade

Run III target

Run I and II

Huge gain in physics capabilities if able to run at larger luminosity

LHCb will further expand its physics program as GPD.

[CERN-LHCC-2011-001]

☞μ = 1.1-1.8

☞μ = 7.6 ☞50 fb-1

☞3 (Run I) + 5 (Run II) fb-1

New detectors and trigger strategy.

LHCb upgrade and trigger strategy

[Run I]

L ⇡ 4 ⇥ 1032cm�2s�1AAACJnicbVDLSgMxFL3js9bXqEs3QRHcWGaqoG6k6MaFCwWrhU5bM2mqoclMSDJiGeZD3LvxV9wIPhB3foqZ1oW2HgicnHMv994TSs608bxPZ2x8YnJqujBTnJ2bX1h0l5YvdJwoQqsk5rGqhVhTziJaNcxwWpOKYhFyehl2j3L/8pYqzeLo3PQkbQh8HbEOI9hYqeUeBAKbG4J5epKhAEup4ju0gwLDBNXI95rpdjlDaaAEIiJrplvlrP/ROfezlrvulbw+0Cjxf8h6ZUbeXwHAact9CdoxSQSNDOFY67rvSdNIsTKMcJoVg0RTiUkXX9O6pRG2azTS/p0Z2rBKG3ViZV9kUF/93ZFioXVPhLYyv0oPe7n4n1dPTGevkbJIJoZGZDCok3BkYpSHhtpMUWJ4zxJMFLO7InKDFSbGRlu0IfjDJ4+Sarm0X/LPbBiHMEABVmENNsGHXajAMZxCFQg8wBO8wpvz6Dw7787HoHTM+elZgT9wvr4BdH+myQ==AAACJnicbVDLTgIxFO34BHyhLt00EhM3khk0UTeG6MaFC0xEiAyQTinQ0M40bcdAJvMh7t248j/cmPiIceefaAdYKHiSJqfn3Jt77/EEo0rb9qc1Mzs3v7CYSmeWlldW17LrG9cqCCUmZRywQFY9pAijPilrqhmpCkkQ9xipeL2zxK/cEqlo4F/pgSB1jjo+bVOMtJGa2ROXI93FiEUXMXSREDLowwPoasqJgo7diPYLMYxcySHmcSPaK8TDj0q4EzezOTtvDwGniTMmuWJa3N089r9LzeyL2wpwyImvMUNK1Rxb6HqEpKaYkTjjhooIhHuoQ2qG+sisUY+Gd8Zwxygt2A6keb6GQ/V3R4S4UgPumcrkKjXpJeJ/Xi3U7aN6RH0RauLj0aB2yKAOYBIabFFJsGYDQxCW1OwKcRdJhLWJNmNCcCZPniblQv4471yaME7BCCmwBbbBLnDAISiCc1ACZYDBPXgCr+DNerCerXfrY1Q6Y417NsEfWF8/SPCo6Q==AAACJnicbVDLTgIxFO34BHyhLt00EhM3khk0UTeG6MaFC0xEiAyQTinQ0M40bcdAJvMh7t248j/cmPiIceefaAdYKHiSJqfn3Jt77/EEo0rb9qc1Mzs3v7CYSmeWlldW17LrG9cqCCUmZRywQFY9pAijPilrqhmpCkkQ9xipeL2zxK/cEqlo4F/pgSB1jjo+bVOMtJGa2ROXI93FiEUXMXSREDLowwPoasqJgo7diPYLMYxcySHmcSPaK8TDj0q4EzezOTtvDwGniTMmuWJa3N089r9LzeyL2wpwyImvMUNK1Rxb6HqEpKaYkTjjhooIhHuoQ2qG+sisUY+Gd8Zwxygt2A6keb6GQ/V3R4S4UgPumcrkKjXpJeJ/Xi3U7aN6RH0RauLj0aB2yKAOYBIabFFJsGYDQxCW1OwKcRdJhLWJNmNCcCZPniblQv4471yaME7BCCmwBbbBLnDAISiCc1ACZYDBPXgCr+DNerCerXfrY1Q6Y417NsEfWF8/SPCo6Q==AAACJnicbVDLSsNAFJ34rPUVdelmsAhuLEkV1I0U3bhwUcHYQpOWyXTSDp1JwsxELCF/48ZfcSP4QNz5KU7SLrT1wMCZc+7l3nv8mFGpLOvLmJtfWFxaLq2UV9fWNzbNre07GSUCEwdHLBItH0nCaEgcRRUjrVgQxH1Gmv7wMveb90RIGoW3ahQTj6N+SAOKkdJS1zx3OVIDjFh6nUEXxbGIHuAxdBXlRELb6qRHtQymruAQ86yTHtay4iNzbmdds2JVrQJwltgTUgETNLrmq9uLcMJJqDBDUrZtK1ZeioSimJGs7CaSxAgPUZ+0NQ2RXsNLizszuK+VHgwioV+oYKH+7kgRl3LEfV2ZXyWnvVz8z2snKjj1UhrGiSIhHg8KEgZVBPPQYI8KghUbaYKwoHpXiAdIIKx0tGUdgj198ixxatWzqn1jVeoXkzRKYBfsgQNggxNQB1egARyAwSN4Bm/g3XgyXowP43NcOmdMenbAHxjfPyxPpRc=

Run I + II target : 8 fb-1  Run III + IV target: 50 fb-1

A real fill with the  upgrade overlaid

Run at 5× higher luminosityTriggerless readout at 40 MHzNew vertex locatorNew tracking (UT, SciFi)

CERN-LHCC-2012-007


https://cds.cern.ch/record/1443882

Challenges in Run III

3

Reaching the MHz signal eraRun 3: Luminosity of 2x1033 cm-2s-1, √s = 14 TeV

At increased luminosity, charm (beauty) in 24 % (2 %) ofbunch crossings

Cannot write out charm at 7 MHz

Trigger must distinguish signal from less-interesting signalas well as from backgroundNo longer feasible to have first trigger based oncalorimeters and muon detectors aloneNeed as much information about an event as soon aspossible

6

Change in trigger paradigm

Access as much information about the collision as early as possible


LHCb trigger in Run III

x86 CPU farm

Run 1 Run 2

pp Collisions

Hardware L0

EB

HLT1

9 PB bu↵ercalibration

HLT2

Storage

1 TB/s1 TB/s

50 GB/s50 GB/s

50 GB/s50 GB/s

4 GB/s

0.3 GB/s 0.7 GB/s

6 GB/s

6 GB/s

x86 CPU farm

Run 3: Baseline

pp Collisions

EB

HLT1

bu↵er on diskcalibration and alignment

HLT2

Storage

5 TB/s

5 TB/s

10 GB/s

x86 CPU farm

Run 3: GPU-enhanced

pp Collisions

EB+

HLT1 on GPUs


HLT2

Storage

5 TB/s

0.2 TB/s

10 GB/s

Figure 2: Evolution of the LHCb trigger system. Real-time calibration and alignment was firstperformed between the HLT stages in Run 2. The FPGA-based hardware stage will be removedin Run 3. Our proposal focuses on adding GPUs to the EB servers and running the entire HLT1sequence on GPUs. This reduces the bandwidth that needs to be transmitted from the EB nodesto the CPU farm from 5 TB/s to 0.2TB/s. The cost savings on networking is expected to be morethan the total cost of the GPUs needed to run HLT1 on the EB servers. Furthermore, the entire(fixed-size) CPU farm would be available for running HLT2.

(the Event Filter Farm or EFF) for processing by a software application called the high-level trigger(HLT), which is divided into two stages. HLT1 partially reconstructs events and selects a subset forfurther processing by HLT2, which performs a more complete reconstruction then executes manyselection algorithms to further reduce the rate at which data are ingested for permanent storage.

In Run 1, the combination of limited CPU in the EFF (20k logical cores), lack of experiencewith the data (a new detector), and suboptimal algorithms limited HLT1 to reconstructing onlya low-fidelity subset of the interesting objects in each event. Similarly, HLT2 was not able toreconstruct all objects, and the lack of data calibrations available in real time meant that o✏inereconstruction was necessary to produce the high quality data required for physics analysis. Despitethese limitations, Williams pioneered the use of ML already in 2011 in the primary HLT2-selectionalgorithm, known as the Topological Trigger (TOPO). About 60% of all LHCb publications to-date were produced using data recorded by the ML-based TOPO. By the end of Run 1, innovationslike the TOPO made it possible for LHCb to process proton-proton collisions at twice its designmaximum rate, while recording signal samples at more than twice the anticipated rates and withhigher than expected purities.

For Run 2, Williams and collaborators from the Yandex corporation reoptimized the TOPO [10].Furthermore, they replaced the primary HLT1 algorithms with ML-based versions. In Run 2, 75%of the data persisted by HLT1 were selected by these ML-based algorithms. Other major changeswere also made to the trigger in Run 2. Increasing the number of (logical) EFF cores to 50k anddeploying faster reconstruction algorithms allowed both HLT1 and HLT2 to execute the o✏ine

4

Hardware trigger to be removed fromRun IIIHLT1 software trigger must perform at30× higher rate with 5× the pileupBuffer will reduce fromO(weeks)→ O(days)Significant increase in data transfer ratesNew trigger setup offers up to ∼ 10×efficiency improvement for some physicschannels


Run III baseline HLT1 performance

Significant progress made tooptimise tracking algorithms∼ 4× improvement in throughputfrom vectorising, improvements toevent model and optimisation

Figure 4.2: Throughput of the displaced-track reconstruction sequence, as a function of the product ofthe number of processes and number of threads, for different number of threads per process, as indicatedin the legend. The throughput peak performance is 12400 evt/s/node for 2 processes and 20 threads perprocess. The “non-Hive” line indicates the performance that is achieved without multithreading.

Figure 4.3: Maximum throughput of the displaced-track reconstruction sequence, as a function of the cuton the impact parameter (in µm), for different transverse momentum thresholds in the pattern recognitionalgorithms, as indicated in the legend.

38

2018

-09-

2520

18-0

9-30

2018

-10-

0520

18-1

0-10

2018

-10-

1520

18-1

1-09

2018

-11-

15

2018

-12-

1320

18-1

2-30

2019

-01-

1120

19-0

1-17

2019

-03-

3020

19-0

4-05

2019

-04-

1120

19-0

4-16

2019

-04-

2120

19-0

4-27

2019

-05-

0320

19-0

5-09

2019

-05-

2720

19-0

6-02

2019

-06-

0820

19-0

6-14

2019

-06-

1920

19-0

6-24

2019

-07-

0520

19-0

7-11

2019

-07-

1720

19-0

7-28

2019

-06-

2920

19-0

7-05

2019

-07-

1120

19-0

7-17

0

5

10

15

20

25

30

35

Even

ts p

roce

ssed

per

seco

nd,

assu

min

g 10

00 re

fere

nce

node

s (M

Hz) LHCb Upgrade simulation

Scalar event model, maximal SciFi reconstructionScalar event model, fast SciFi reconstruction with tighter track tolerance criteriaScalar event model, vectorizable SciFi reconstruction with entirely reworked algorithm logicFully SIMD-POD friendly event model, vectorizable SciFi and vectorized vertex detector and PV reconstruction, I/O improvements

multi-threaded processes offer gains overrunning more processes in parallelOptimal CPU architecture underinvestigation – new AMD architecture offerssignificant cost/benefit improvements

LHCB-FIGURE-2019-002LHCB-TDR-017


https://cds.cern.ch/record/2684267https://cds.cern.ch/record/2310827

Run III baseline HLT1 performance

Allows for a flexible and configurablesequencePhysics performance of single-track andtwo-track selections studiedLoose (L) and tight (T) versions ofalgorithms simulated with different pTthresholds (500− 1000 MeV/c)(Top) ∼ 1 MHz output rate achievable basedon “minimum bias” simulationTwo-track selection remains efficientSingle-track selection still requires work

Upgrade trigger: Bandwidth strategy proposal Ref: LHCb-PUB-2017-006Public Note Issue: 13 Selection at HLT1 Date: February 22, 2017

500 MeV 750 MeV 1000 MeV

Rat

e (M

Hz)

0

0.5

1

1.5

2

LL LT TL TT LL LT TL TT LL LT TL TT

Hea

vy f

lavo

r pu

rity

0

0.2

0.4

0.6

0.8

11-Track 2-Track 1- OR 2-Track

LL LT TL TT LL LT TL TT LL LT TL TT0

0.2

0.4

0.6

0.8

1

500 MeV 750 MeV 1000 MeV

Eff

icie

ncy

-K+K →0D +π-K+K →+D -π+π-K+K →0D -K+KS0K →0D +π+cΛ →++cΣ


LL LT TL TT LL LT TL TT LL LT TL TT0

0.2

0.4

0.6

0.8

1

500 MeV 750 MeV 1000 MeV

Eff

icie

ncy

+K0

D →+B +KS0K →+B -µ+µ*K →0B ν+µ-D →0B -D+D →0B +π-K+K →+B


Figure 1: HLT1 inclusive trigger (top) rates and heavy-flavor purities from minimum bias MC, and (middle)charm and (bottom) signal efficiencies on loosely selected MC. In each plot, the vertical light-grey lines separateresults in which tracking pT thresholds of 500, 750, and 1000 MeV are used. The LL, LT, TL, and TT denote loose(L) and tight (T) selections applied in the 1-Track and 2-Track lines, where the first letter is for the 1-Track and thesecond for the 2-Track.

triggered by a b or c hadron. The purity of the 2-Track is 65-80%, depending on the MVA threshold, andon the value of the tracking pT threshold. The 1-Track, however, is only about 25% pure. The low purityof the 1-Track is partially due to its large ghost contamination of 30-40%, based on the assumptionthat 20% of ghost tracks remain after a Run 2-like ghost rejection requirement. For all modes exceptB+→K0SK+, the 2-Track provides most of the signal efficiency, with HLT1 efficiencies for beautyaveraging ∼ 80% at 1MHz output rate. Charm efficiencies of ∼ 60% are achievable at the same rate.If the output rate must be reduced further, a 2-Track only HLT1 would still be capable of providingaverage efficiencies of ∼ 60% for beauty and ∼ 50% for charm signals at 0.5MHz assuming no furtheroptimisation of the 2-Track inclusive trigger. In reality however, the 2-Track would benefit from furtheroptimisation in this operating regime, and for topologies including a K0s exclusive selections would berequired to obtain a high efficiency.

page 4

LHCb-PUB-2017-006


https://cds.cern.ch/record/2244313

LHCb trigger in Run III

x86 CPU farm

Run 1 Run 2

pp Collisions

Hardware L0

EB

HLT1

9 PB bu↵ercalibration

HLT2

Storage

1 TB/s1 TB/s

50 GB/s50 GB/s

50 GB/s50 GB/s

4 GB/s

0.3 GB/s 0.7 GB/s

6 GB/s

6 GB/s

x86 CPU farm

Run 3: Baseline

pp Collisions

EB

HLT1


HLT2

Storage

5 TB/s

5 TB/s

10 GB/s

x86 CPU farm

Run 3: GPU-enhanced

pp Collisions

EB+

HLT1 on GPUs


HLT2

Storage

5 TB/s

0.2 TB/s

10 GB/s

Figure 2: Evolution of the LHCb trigger system. Real-time calibration and alignment was firstperformed between the HLT stages in Run 2. The FPGA-based hardware stage will be removedin Run 3. Our proposal focuses on adding GPUs to the EB servers and running the entire HLT1sequence on GPUs. This reduces the bandwidth that needs to be transmitted from the EB nodesto the CPU farm from 5 TB/s to 0.2TB/s. The cost savings on networking is expected to be morethan the total cost of the GPUs needed to run HLT1 on the EB servers. Furthermore, the entire(fixed-size) CPU farm would be available for running HLT2.

(the Event Filter Farm or EFF) for processing by a software application called the high-level trigger(HLT), which is divided into two stages. HLT1 partially reconstructs events and selects a subset forfurther processing by HLT2, which performs a more complete reconstruction then executes manyselection algorithms to further reduce the rate at which data are ingested for permanent storage.

In Run 1, the combination of limited CPU in the EFF (20k logical cores), lack of experiencewith the data (a new detector), and suboptimal algorithms limited HLT1 to reconstructing onlya low-fidelity subset of the interesting objects in each event. Similarly, HLT2 was not able toreconstruct all objects, and the lack of data calibrations available in real time meant that o✏inereconstruction was necessary to produce the high quality data required for physics analysis. Despitethese limitations, Williams pioneered the use of ML already in 2011 in the primary HLT2-selectionalgorithm, known as the Topological Trigger (TOPO). About 60% of all LHCb publications to-date were produced using data recorded by the ML-based TOPO. By the end of Run 1, innovationslike the TOPO made it possible for LHCb to process proton-proton collisions at twice its designmaximum rate, while recording signal samples at more than twice the anticipated rates and withhigher than expected purities.

For Run 2, Williams and collaborators from the Yandex corporation reoptimized the TOPO [10].Furthermore, they replaced the primary HLT1 algorithms with ML-based versions. In Run 2, 75%of the data persisted by HLT1 were selected by these ML-based algorithms. Other major changeswere also made to the trigger in Run 2. Increasing the number of (logical) EFF cores to 50k anddeploying faster reconstruction algorithms allowed both HLT1 and HLT2 to execute the o✏ine

4

Option to move to aGPU-based HLT1 withGPUs installed on theEvent Builder serversFree up full CPU farmfor HLT2 and save onnetworking betweenevent builders andCPU farmDemonstratedtechnical feasibilityDecision due in early2020


Why GPUs?

Moore’s law still holds butsingle-thread performance haslevelled offGains now to be made throughparallelisationGPUs specialised formassively parallel operations(100s–1000s of cores)


HLT1

HLT1 involves decoding,clustering and trackreconstruction for all trackingsubdetectorsAlgorithms also performKalman filter and triggerselectionAll stages of the process maybe parallelised

Raw data

GlobalEvent Cut

Velo decodingand clustering

Velo tracking

Simple Kalman filter

Find primary vertices

UT decoding

UT tracking

SciFi decoding

SciFi tracking

ParameterizedKalman filter

Muon decoding

Muon ID

Find sec-ondary vertices

Select events

Selected events


The Allen project

Generic configurable framework for GPU-based execution of an algorithm sequenceData passed to GPU deviceAll algorithms executed in orderResults passed back to the host

Process thousands of events in a single sequenceOpportunity for massive parallelisation

Configurable sequences at compile timeConfigurable algorithms at run timeCustom GPU memory management – no dynamic allocationBuilt in validation and monitoringCross-platform compatibility with CPU architecturesNamed for Frances E. AllenImplement HLT1 on GPUs

Photo: User:Rama / Wikimedia Commons / CC-BY-SA-2.0 fr

LHCB-FIGURE-2019-009


http://cds.cern.ch/record/2693058/

Allen selection ingredientsPrimary vertices

36

Ingredients for selections

Selections● 1-track selection ● 2-track selection● Based on p, pt, displacement,

vertex criteria and muon identification

0 2000 4000[MeV]

Tp

0

0.2

0.4

0.6

0.8

1

effici

ency

efficiency, not electronsefficiency, electrons

distribution, not electronsT

pdistribution, electrons

Tp

Num

ber o

f eve

nts [

a.u.

]

LHCb simulation GPU R&D

Tracks0 20 40 60 80 100

310×

p [MeV/c]0

0.2

0.4

0.6

0.8

1

1.2

/p [%

]pσ

p resolution

p distribution Num

ber o

f eve

nts [

a.u.

]


Momentum

Muon identification

0 20 40track multiplicity of MC PV

0

0.2

0.4

0.6

0.8

1

effici

ency

efficiency

multiplicity distribution

Num

ber o

f eve

nts [

a.u.

]


Primary vertices

Impact parameter

Secondary vertices

Tracks

36




0 2000 4000[MeV]

Tp

0

0.2

0.4

0.6

0.8

1

effici

ency




Tp

Num

ber o

f eve

nts [

a.u.

]


Tracks0 20 40 60 80 100

310×

p [MeV/c]0

0.2

0.4

0.6

0.8

1

1.2

/p [%

]pσ

p resolution

p distribution Num

ber o

f eve

nts [

a.u.

]


Momentum

Muon identification


0

0.2

0.4

0.6

0.8

1

effici

ency

efficiency


Num

ber o

f eve

nts [

a.u.

]


Primary vertices

Impact parameter

Secondary vertices

SelectionsOne trackTwo tracksSingle muonTwo muons (displaced)Two muons (high-mass)

Momentum

36




0 2000 4000[MeV]

Tp

0

0.2

0.4

0.6

0.8

1

effici

ency




Tp

Num

bero

feve

nts [

a.u.

]


Tracks0 20 40 60 80 100

310×

p [MeV/c]0

0.2

0.4

0.6

0.8

1

1.2

/p [%

]pσ

p resolution

p distribution Num

ber o

f eve

nts [

a.u.

]


Momentum

Muon identification


0

0.2

0.4

0.6

0.8

1

effici

ency

efficiency


Num

bero

feve

nts [

a.u.

]


Primary vertices

Impact parameter

Secondary vertices

Impact parameter

0 500 1000 1500 2000Rate [kHz]

00.10.20.30.40.50.60.70.80.9

1

Effic

ienc

y

2-Track, Parameterized 2-Track, Simple

1-Track, Parameterized 1-Track, Simple

LHCb simulation

GPU R&D

Figure 6: Efficiency of the 1-Track and 2-Track trigger lines when calculating the IP χ2 from tracksfitted with the simple Kalman filter directly after the Velo tracking versus the parameterizedKalman filter on the Velo-segment of a track using the momentum estimate from the Forwardtracking. Using the B0s → φφ sample.

Figure 7: Throughput of the entire HLT1 sequence on single GPU cards.

Trigger Rate [kHz]

1-Track 249± 182-Track 663± 30High-pT muon 1± 1Displaced dimuon 50± 8High-mass dimuon 101± 12Total 971± 36

Table 1: Rates of the five trigger lines implemented in Allen and the total HLT1 output rate,determined with minimum bias events.

5

Secondary vertices

36




0 2000 4000[MeV]

Tp

0

0.2

0.4

0.6

0.8

1

effici

ency




Tp

Num

ber o

f eve

nts [

a.u.

]


Tracks0 20 40 60 80 100

310×

p [MeV/c]0

0.2

0.4

0.6

0.8

1

1.2

/p [%

]pσ

p resolution

p distribution Num

ber o

f eve

nts [

a.u.

]


Momentum

Muon identification


0

0.2

0.4

0.6

0.8

1

effici

ency

efficiency


Num

ber o

f eve

nts [

a.u.

]


Primary vertices

Impact parameter

Secondary verticesMuon ID

0 20 40 60 80 100310×

p [MeV/c]0

0.2

0.4

0.6

0.8

1

1.2

/p [%

]pσ

p resolution

p distribution Num

ber o

f eve

nts [

a.u.

]

LHCb simulation

GPU R&D

Figure 4: Momentum resolution of tracks passing through the Velo, UT and SciFi detectors versusmomentum, using the combination of signal samples described in the text. Points represent themean, error bars the width of a Gaussian distribution fitted to the resolution in every momentumslice. The momentum distribution is overlaid as histogram.

0 50 100310×

p [MeV]0

0.2

0.4

0.6

0.8

1

Muo

n ID

effi

cien

cy

Muon ID efficiency

p distribution Num

ber o

f eve

nts [

a.u.

]

LHCb simulation

GPU R&D

Figure 5: Muon identification efficiency versus momentum for tracks passing through the Velo,UT and SciFi detectors with respect to all reconstructible muons, using the combination ofsignal samples described in the text. The momentum distribution is overlaid as histogram.

4




Allen performance

Trigger Rate [kHz]1-Track 249± 182-Track 663± 30High-pT muon 1± 1Displaced dimuon 50± 8High-mass dimuon 101± 12Total 971± 36

Total rate reduced 30→ 1 MHzPhysics performance consistent with x86baseline

Signal GEC TIS -OR- TOS TOS GEC× TOSB0 → K ∗0µ+µ− 89 ± 2 85 ± 2 78 ± 3 69 ± 3B0 → K ∗0e+e− 84 ± 3 69 ± 4 62 ± 4 53 ± 3B0s → φφ 83 ± 3 70 ± 3 65 ± 4 54 ± 3D+s → K +K−π+ 82 ± 4 62 ± 5 38 ± 5 32 ± 4Z → µ+µ− 78 ± 1 97 ± 1 97 ± 1 75 ± 1

GEC = global event cut, TIS = trigger independent of signal, TOS= trigger on signal




Allen throughput

39

Throughput on various GPUs

Throughput of the full HLT1 sequence

HLT1 can run on 500 GPUs→ read out full detector Buy GPUs instead of expensive network

0

20

40

60

80

100

120

140

160

Num

ber

of e

vent

s [a

.u.]

6 8 10 12 14 16SciFi raw data volume [kB]

0

20

40

60

80

100

120

140

160

Thro

ughp

ut o

n Te

sla

V100

16G

B [k

Hz]

Data volume distributionThroughput

LHCb simulation

GPU R&D

Figure 8: Throughput of the Allen sequence as a function of the SciFi raw data volume. Theraw data volume distribution is overlaid as histogram.

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5Theoretical 32 bit TFLOPS

0

10

20

30

40

50

60

70

80

Alle

n th

roug

hput

[kH

z]

GeForce GTX 670GeForce GTX 680

GeForce GTX 1060 6GBGeForce GTX TITAN X

GeForce GTX 1080 TiTesla T4

GeForce RTX 2080 TiQuadro RTX 2080 Ti

Tesla V100 32GBLHCb simulation

GPU R&D

Figure 9: Allen throughput on various GPUs with respect to their reported peak 32-bit FLOPSperformance.

6

Full HLT1 algorithm can be run on ∼ 500 current GPUsBuy GPUs instead of networkingPerformance scales with GPU so can expect more from 2021 GPUs

Not yet limited by Amdahl’s lawPotential to perform more tasks within HLT1




Summary

Real-time reconstruction and calibration a success story for LHCb in Run IIOffline-quality reconstruction allowed for many trigger selections to be moved to theturbo stream

Selections can be based on offline-quality featuresSmaller event size→ higher event rate for same disk rateTradeoff – full raw event not saved→ cannot rerun reconstruction offlineAlready crucial for charm decays in Run II

LHCb detector and DAQ upgrades for Run IIINo hardware triggerFirst-stage software trigger must perform track reconstruction at bunch-crossing rate

Baseline x86 implementation of first-stage trigger significantly optimised to deal withhigher throughputAllen project offers a GPU-implementation

Generic framework allows for configurable algorithm sequenceFeasibility for possible use in LHCb Run III already demonstrated


Documents

The Trigger and Real-time Reconstruction at LHCb...Dec 09, 2019 · The LHCb detector Located at point 8 of the LHC General-purpose detector in the forward region Specialised in studying