Boosted Top Taggingwith Deep Neural Networks
Jannicke PearkesUniversity of British Columbia, Engineering Physics
Wojtek Fedorko, Alison Lister, Colin GayInter-Experimental Machine Learning Workshop
March 22nd, 2017
Overview
2
• Introduction • Method
– Monte Carlo Samples– Network architecture & training
• Results – Preprocessing– PT dependence– Pileup dependence– Learning what is being learnt
• Next Steps
Introduction
• Train a deep neural network to discriminate between jets originating from top quarks and those originating from QCD background
3
boost
Low top pTHigh top pT
W
b
W
bImage: Emily Thompson
Monte Carlo Samples• Signal: Z’ to ttbar• Background: Dijet• Generated with PYTHIA v8.219 NNPDF23 LO AS 0130 QED PDF• DELPHES v3.4.0 using default CMS card• Jets clustered using DELPHES energy-flow objects
• Anti-kT jets selected with R = 1.0• Trimming performed with kT algorithm and R = 0.2, pT frac = 5%
• Signal jets are selected where a truth top decays hadronically within 𝛥R= 0.75 of a large radius jet
• Jets are required to have 𝜂<= 2.0• Jets are subsampled to be flat in pT and signal-matched in eta• Looking at jets with pT between 600-2500 GeV
• ~ 4 million signal jets and ~4 million background jets • Sample divided into 80%, 10%, 10% for training, validation and testing
4
Examples of Jet Images
5
�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘
�1.0
�0.5
0.0
0.5
1.0
Tran
slat
edaz
imut
hala
ngle
�
Background jet with pT = 1370 GeV
10�4
10�3
10�2
10�1
100
Jetp
Tpe
rpix
el[G
eV]
�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘
�1.0
�0.5
0.0
0.5
1.0
Tran
slat
edaz
imut
hala
ngle
�
Background jet with pT = 702 GeV
10�4
10�3
10�2
10�1
100
Jetp
Tpe
rpix
el[G
eV]
�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘
�1.0
�0.5
0.0
0.5
1.0
Tran
slat
edaz
imut
hala
ngle
�
Background jet with pT = 2376 GeV
10�4
10�3
10�2
10�1
100
Jetp
Tpe
rpix
el[G
eV]
�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘
�1.0
�0.5
0.0
0.5
1.0
Tran
slat
edaz
imut
hala
ngle
�
Signal jet with pT = 781 GeV
10�4
10�3
10�2
10�1
100
Jetp
Tpe
rpix
el[G
eV]
�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘
�1.0
�0.5
0.0
0.5
1.0
Tran
slat
edaz
imut
hala
ngle
�
Signal jet with pT = 1480 GeV
10�4
10�3
10�2
10�1
100
Jetp
Tpe
rpix
el[G
eV]
�1.0 �0.5 0.0 0.5 1.0Translated pseudorapidity ⌘
�1.0
�0.5
0.0
0.5
1.0
Tran
slat
edaz
imut
hala
ngle
�
Signal jet with pT = 2358 GeV
10�4
10�3
10�2
10�1
100
Jetp
Tpe
rpix
el[G
eV]
Jet images are typically very sparse roughly 5-10% pixel activation on average if using a 0.1x0.1 grid [1][1] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, Jet-images -- deep learning edition, JHEP 07 (2016) 069, arXiv:1511.05190 [hep-ph].
Neural Network Inputs
• Use sequence of jet constituents rather than image
• Advantages: – No loss of information due to pixelization in an image– Inputs are more information dense
• Using 120 constituents average activation is 30%-50%
6
Training and Network Architecture
• Implemented with Keras• Initially planned on using an LSTM, but ended up using a fully connected network • We found that performance between the LSTM and the fully connected network was
very similar, but the deep networks were much faster to train (~10 times) which allowed for faster experimentation with preprocessing techniques and network architectures
7
Network Type Fully connected
Number oflayers
5,[300,150,50,10,5,1]
Number of free parameters
41,323
Activation function
Rectified linear units, sigmoid on output
Optimizer Adam
Loss Binary Cross-Entropy
Early Stopping Patience of 5
Preprocessing
Preprocessing
• Large radius, R = 1.0, jets are trimmed using subjets R = 0.2 found with the kT algorithm with and pT frac = 5%
• Order subjets by subjet pT and jet constituent pTwithin each subjet
• We use only the 120 highest pT jet constituents• Perform preprocessing using domain knowledge
about the physics at hand
9
No Preprocessing
10
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103
Bac
kgro
und
Rej
ectio
nJet pT = 600 - 2500 GeV
Trimming only
Trimming onlyAUC = 0.83Rϵ = 50% = 8.85Rϵ = 80% = 3.36
Scale
• Scale pT of all jet constituents by a common factor to ensure that the constituent pT is approximately between 0 and 1
11
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103
Bac
kgro
und
Rej
ectio
nJet pT = 600 - 2500 GeV
Trimming onlyScale
Scale
12
ScalingAUC = 0.900Rϵ = 50% = 21.3Rϵ = 80% = 6.02
Translate
• Center jet about highest pT subjetin 𝜂, 𝜙 plane
13
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103
Bac
kgro
und
Rej
ectio
nJet pT = 600 - 2500 GeV
Trimming onlyScaleTranslation
Translate
14
TranslationAUC = 0.924Rϵ = 50% = 33.2Rϵ = 80% = 8.48
Rotate• Designed method of rotations
to preserve jet mass• Transform 𝑝', 𝜂, 𝜙 into
𝑝), 𝑝*,, 𝑝+• Rotate so that second highest
pT subjet is aligned with negative y-axis:
• Transform (𝑝), 𝑝*,, 𝑝+) back to 𝑝', 𝜂, 𝜙
15
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103
Bac
kgro
und
Rej
ectio
nJet pT = 600 - 2500 GeV
Trimming onlyScaleTranslationRotation
Rotate
16
RotationAUC = 0.932Rϵ = 50% = 42.3Rϵ = 80% = 9.57
Flip
• Third subjet is not constrained, but can be moved to right half of plane
• Flip jet if average pT is in left half of plane
17
Flip
18
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103
Bac
kgro
und
Rej
ectio
nJet pT = 600 - 2500 GeV
Trimming onlyScaleTranslationRotationFlip
FlipAUC = 0.933Rϵ = 50% = 44.3Rϵ = 80% = 9.75
Performance onTruth vs Reconstructed Jets
Performance after preprocessing
20
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103B
ackg
roun
dR
ejec
tion
Jet pT = 600 - 2500 GeV
DNN, truth⌧32, truthDNN, reco⌧32, reco
Performance at 50% overall Signal Efficiency
21
600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
Sig
nale
ffici
ency
Signal efficiencyBackground rejection
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
Sig
nale
ffici
ency
Signal efficiencyBackground rejection
0
10
20
30
40
50
60
70
80
Bac
kgro
und
reje
ctio
n
Reconstructed JetsTruth Jets
AUC = 0.947Rϵ = 50% = 66Rϵ = 80% = 13
AUC = 0.933Rϵ = 50% = 44Rϵ = 80% = 9.7
Pileup
Performance at different levels of pileup
23
0.0 0.2 0.4 0.6 0.8 1.0Top Tagging Efficiency
100
101
102
103B
ackg
roun
dR
ejec
tion
Jet pT = 600 - 2500 GeV
No pile upPile up = 23Pile up = 50
Extremely stable performance with respect to pileup
24
600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]
0.0
0.2
0.4
0.6
0.8
1.0S
igna
leffi
cien
cy
Signal efficiency: No pile upSignal efficiency: Pile up = 23Signal efficiency: Pile up = 50
Background rejection: No pile upBackground rejection: Pile up = 23Background rejection: Pile up = 50
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
Performance at different levels of pileup
pT dependence also stable with respect to pileup
Learning what is being learnt
0 100 200 300 400 500Jet mass [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
DN
Nou
tput
Background Jets
0.000
0.015
0.030
0.045
0.060
0.075
0.090
0.105
0.120
P(J
etm
ass
[GeV
]|DN
Nou
tput
)
Jet Mass
26
0 50 100 150 200 250 300 350 400
Jet mass [GeV]
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014Flat pT distribution600 < jet pT < 2500 GeV
SignalBackground
0 100 200 300 400 500Jet mass [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
DN
Nou
tput
Background Jets
0.000
0.015
0.030
0.045
0.060
0.075
0.090
0.105
0.120
P(J
etm
ass
[GeV
]|DN
Nou
tput
)
Jet Mass
27
0 50 100 150 200 250 300 350 400
Jet mass [GeV]
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014Flat pT distribution600 < jet pT < 2500 GeV
SignalBackground
Next StepsShort term:• We plan to revisit LSTMs• Thorough Bayesian hyper-parameter optimization
Longer term:• Both top and W tagging with deep neural networks now
reasonably well-established on Monte Carlo• “But does it work on data?”• Start working towards evaluating the performance of these
techniques on data • Investigate effects of systematics and strategies for
mitigating the impact of systematics
28
Thank you!
29
W-tagging performance on truth
30
QCD-Aware Recursive Neural Networks for Jet Physics.Louppe, Cho, Becot, Cranmer https://arxiv.org/abs/1702.00748
Zooming
31
Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks Barnard, Dawe, Dolan, Rajcic https://arxiv.org/pdf/1609.00607v2.pdf
Performance when trained and tested on different levels of pileup
600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
Sig
nale
ffici
ency
Signal efficiency: NN trained on µ = 23 tested on µ = 0Signal efficiency: NN trained on µ = 23 tested on µ = 23Signal efficiency: NN trained on µ = 23 tested on µ = 50Background rejection: NN trained on µ = 23 tested on µ = 0Background rejection: NN trained on µ = 23 tested on µ = 23Background rejection: NN trained on µ = 23 tested on µ = 50
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
32
600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
Sig
nale
ffici
ency
Signal efficiency: NN trained on µ = 0 tested on µ = 0Signal efficiency: NN trained on µ = 0 tested on µ = 23Signal efficiency: NN trained on µ = 0 tested on µ = 50Background rejection: NN trained on µ = 0 tested on µ = 0Background rejection: NN trained on µ = 0 tested on µ = 23Background rejection: NN trained on µ = 0 tested on µ = 50
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
600 800 1000 1200 1400 1600 1800 2000 2200 2400Jet pT [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
Sig
nale
ffici
ency
Signal efficiency: NN trained on µ = 50 tested on µ = 0Signal efficiency: NN trained on µ = 50 tested on µ = 23Signal efficiency: NN trained on µ = 50 tested on µ = 50Background rejection: NN trained on µ = 50 tested on µ = 0Background rejection: NN trained on µ = 50 tested on µ = 23Background rejection: NN trained on µ = 50 tested on µ = 50
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
0
10
20
30
40
50
60
Bac
kgro
und
reje
ctio
n
- Examined how a neural network trained at one pileup level performs on another level of pileup
- NN seems relatively robust to changes in pileup expected at the LHC in the next few years
0 100 200 300 400 500Jet mass [GeV]
0.0
0.2
0.4
0.6
0.8
1.0
DN
Nou
tput
Background Jets
0.000
0.015
0.030
0.045
0.060
0.075
0.090
0.105
0.120
P(J
etm
ass
[GeV
]|DN
Nou
tput
)
Jet Mass
33
0 50 100 150 200 250 300 350 400
Jet mass [GeV]
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014Flat pT distribution600 < jet pT < 2500 GeV
SignalBackground
34
0.0 0.2 0.4 0.6 0.8 1.0 1.2
⌧32
0.0
0.5
1.0
1.5
2.0
2.5Flat pT distribution600 < jet pT < 2500 GeV
SignalBackground
0.0 0.2 0.4 0.6 0.8 1.0⌧wta
32
0.0
0.2
0.4
0.6
0.8
1.0
DN
Nou
tput
Background Jets
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
P(⌧
wta
32|D
NN
outp
ut)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
⌧32
0.0
0.5
1.0
1.5
2.0
2.5Flat pT distribution600 < jet pT < 2500 GeV
SignalBackground