Upload
ella-susanna-williamson
View
218
Download
3
Embed Size (px)
Citation preview
TDAQTDAQ
AA
TT
LL
AA
SS
Reimplementation of the ATLAS Reimplementation of the ATLAS Online Event Monitoring Online Event Monitoring SubsystemSubsystem
Ingo ScholtesSummer StudentUniversity of Trier
Supervisor: Serguei Kolos
TDAQ
Ingo Scholtes - Summer Student, University of Trier
2
AA
TT
LL
AA
SS
OutlineOutline Online Monitoring Current implementation and its drawbacks My reimplementation Performance comparison Conclusion
TDAQ
Ingo Scholtes - Summer Student, University of Trier
3
AA
TT
LL
AA
SS
Getting the context: Atlas Online Getting the context: Atlas Online SoftwareSoftware
System of the Atlas Trigger DAQ Project Main purpose: configure, control and monitor data acquisition system Provides a GUI, which allows to control the data acquisition system “Glue” of several TDAQ sub-systems Open Source project
Configuration Monitoring
Control Tutorial
Online SoftwareOnline Software
<<subsystem>>
Information Service
<<subsystem>>
HistogramDistribution
<<subsystem>>
ErrorReporting
<<subsystem>><<subsystem>>MonitoringMonitoring
<<subsystem>>
Event SampleMonitoring
TDAQ
Ingo Scholtes - Summer Student, University of Trier
4
AA
TT
LL
AA
SS
Criteria D
Criteria D
Criteria A
Criteria B
Criteria A
A brief overview of the ATLAS Online A brief overview of the ATLAS Online Event Monitoring SubsystemEvent Monitoring Subsystem
Detector Electronics
Read-Out-Driver (ROD)
ROD
Read-Out-System(ROS)
Event Sampler
Monitor Task
Monitor Task
Monitor Task
Monitor Task
Event Sampler
Event Builder
ROS
ROD ROD
USERCODE
USERCODE
USERCODE
USERCODE
USERCODE
USERCODE
A+B
D D
CDA+B
A+BC
PhysicistsPhysicists
DataFlowDataFlow
Event SamplerUSERCODE
D
DD
TDAQ
Ingo Scholtes - Summer Student, University of Trier
5
AA
TT
LL
AA
SS
The current implementation in The current implementation in detaildetail
Monitor TaskDistributor
SamplerDetector1/Crate1
Buffer
BufferSampler
Detector1/Crate2
Monitor Task
Monitor Task
Machine AMachine B
Machine C
Machine D
Machine E
Machine F
SELECTDetector1/Crate1
START
ITERATOR
ITERATOR
SELECTDetector1/Crate1
SELECTDetector1/Crate2
START
ITERATOR
TDAQ
Ingo Scholtes - Summer Student, University of Trier
6
AA
TT
LL
AA
SS
Drawbacks…Drawbacks… Scalability problem due to bottleneck in
machine A Monitors will not notice if sampler
crashed They just stop receiving events…
Users will have to worry about thread management in sampler Start thread on StartSampling Cleanly exit it on StopSampling Often causes problems
TDAQ
Ingo Scholtes - Summer Student, University of Trier
7
AA
TT
LL
AA
SS
……and what we learn from and what we learn from themthem
Core of scalability problem: central distributor
Bottleneck due to… routing of events through central
distributor multiple distribution of identical
events to different monitors
TDAQ
Ingo Scholtes - Summer Student, University of Trier
8
AA
TT
LL
AA
SS
Implementation Implementation requirementsrequirements
Platform independent C++ Using Online Monitoring IPC based
on CORBA (omniORB 4) minimal and deterministic effect on
the data flow system performance High scalability Get rid of all drawbacks… ;-)
TDAQ
Ingo Scholtes - Summer Student, University of Trier
9
AA
TT
LL
AA
SS
What is really crucial?What is really crucial? Sampler has to decide about criteria
saves a lot of bandwidth Sampler has to send each event once
(per selection criteria) Distributor necessary to protect sampler
from inrushing monitors (gatekeeper function)
TDAQ
Ingo Scholtes - Summer Student, University of Trier
10
AA
TT
LL
AA
SS
Basic improvement ideasBasic improvement ideas Get rid of distributor for communication P2P Moving load to monitors for means of scalability
Current: share bandwidth, accumulate load Idea: share load, accumulate bandwidth ;-) Distributor only for connection management and error
recovery Keeping only crucial things in sampler
Criteria decisions One-time sending of each event ( at least one
connection per sampler/criteria) Sampler thread management
Start sampling thread with first subscription End sampling thread with loss of last subscription User code not aware of threads
TDAQ
Ingo Scholtes - Summer Student, University of Trier
11
AA
TT
LL
AA
SS
Getting rid of the distributor bottleneckGetting rid of the distributor bottleneck can be obtained by using P2P paradigm
One monitor per sampler
Multiple monitors per sampler?
EM
ES
ES
EM ES
ESEM
EM
We need a way to prevent bottleneck here!
EM
ESEM
EM
ES
EM
EM
EM
TDAQ
Ingo Scholtes - Summer Student, University of Trier
12
AA
TT
LL
AA
SS
Introducing the monitor Introducing the monitor treetree
Monitor
Monitor
Monitor
Monitor
Monitor
Monitor
Monitor
Distributor
Sampler
Bartering time/bandwidth requirements with latencyCosts of distribution to monitors independent of number of monitorsConfigured type (unary=list, binary, …) influences latency/bandwidth tradeoffBut: new problems arise with this structure
Example: Binary Monitor tree
TDAQ
Ingo Scholtes - Summer Student, University of Trier
13
AA
TT
LL
AA
SS
Problem 1: exit of Problem 1: exit of monitorsmonitors Each monitor acts as a sampler for his
children Exits/crashes of monitors critical… We have to distinguish between
different types of exits Leaf monitor trivial Monitor with outdegree > 0 more
complicated Root monitor critical
TDAQ
Ingo Scholtes - Summer Student, University of Trier
14
AA
TT
LL
AA
SS
Solutions – Leaf monitor exitSolutions – Leaf monitor exit
Event Distributor
Event Monitor
EventMonitor
EventMonitor
Event Sampler
Trivial operation, just delete the monitor from the tree!
TDAQ
Ingo Scholtes - Summer Student, University of Trier
15
AA
TT
LL
AA
SS
EventMonitor
Solutions – Monitor with outdegree > 0 Solutions – Monitor with outdegree > 0 exitsexits
Event Distributor
Event Monitor
EventMonitor
EventMonitor
Event Sampler
EventMonitor
More complicated, but the distributor can do it, as he hasknowledge of the whole tree, O(C) complexity with C beingconstant maximum number of children
TDAQ
Ingo Scholtes - Summer Student, University of Trier
16
AA
TT
LL
AA
SS
Solutions – Root monitor exitSolutions – Root monitor exit
Event Distributor
Event Monitor
EventMonitor
EventMonitor
Event Sampler
Critical operation, as sampler is involved, but possible to do it transparently for other
monitors, again O(1) complexity
TDAQ
Ingo Scholtes - Summer Student, University of Trier
17
AA
TT
LL
AA
SS
Problem 2: error recoveryProblem 2: error recovery Crash of sampler
Distributor pings all samplers in reasonable intervals can notify monitors about crash
Crash of arbitrary monitors Detected like normal exit! no problem
Crash of distributor No influence on ongoing data exchange Just restart…
TDAQ
Ingo Scholtes - Summer Student, University of Trier
18
AA
TT
LL
AA
SS
Comparing performance…Current implementation
Reimplementation
Sampler i
Monitor i
Distributor
)(2 eOecheCch
ii
constant)(2 eOe
)(2
)(2
1
:
emOemcreme
mOm
m
s
ii
:Run
Shutdown&Init
0
)(2
:Run
:Shutdown&Init mOm
monitors #
bytes sampled #
onitorchildren/m max.
imonitor ofchildren #
samplers #
isampler in criteria#
m
e
C
ch
s
cr
i
i
)(2 iii creOcrecr )(2 iii creOcrecr
TDAQ
Ingo Scholtes - Summer Student, University of Trier
19
AA
TT
LL
AA
SS
ROD Profile ROD Profile (10.000 Events @ 4K)(10.000 Events @ 4K)
event rate per monitor/samplerROD Profile (4K/event)
0
500
1000
1500
2000
2500
3000
1 2 3 4 5 6 7 8 9 10
Samplers in partition
Ev
en
ts/s
Reimplementation
Current implementation
TDAQ
Ingo Scholtes - Summer Student, University of Trier
20
AA
TT
LL
AA
SS
ROD Profile ROD Profile (10.000 Events @ 4K)(10.000 Events @ 4K)
total data transfer rateROD profile (4K/event)
0
10
20
30
40
50
60
70
80
90
100
110
120
1 2 3 4 5 6 7 8 9 10
Samplers in partition
MB
/s
Current implementation
Reimplementation
TDAQ
Ingo Scholtes - Summer Student, University of Trier
21
AA
TT
LL
AA
SS
ROS profile ROS profile (10.000 Events @ 30 K)(10.000 Events @ 30 K)
event rate per monitor/samplerROS Profile (30K/event)
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8 9 10
Samplers in partition
Ev
en
ts/s
Current implementation
Reimplementation
TDAQ
Ingo Scholtes - Summer Student, University of Trier
22
AA
TT
LL
AA
SS
ROS profile ROS profile (10.000 Events @ 30 K)(10.000 Events @ 30 K)
total data transfer rateROS profile (30K/event)
0
10
20
30
40
50
60
70
80
90
100
110
1 2 3 4 5 6 7 8 9 10
Samplers in partition
MB
/s
Current implementation
Reimplementation
TDAQ
Ingo Scholtes - Summer Student, University of Trier
23
AA
TT
LL
AA
SS
EB profile EB profile (100 Events @ 2MB)(100 Events @ 2MB)
event rate per monitor/samplerEB Profile (2MB/event)
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10
Samplers in partition
Ev
en
ts/s
Current implementation
Reimplementation
TDAQ
Ingo Scholtes - Summer Student, University of Trier
24
AA
TT
LL
AA
SS
EB profile EB profile (100 Events @ 2MB)(100 Events @ 2MB)
total data transfer rateEB Profile (2MB/Event)
0
10
20
30
40
50
60
70
80
90
100
110
1 2 3 4 5 6 7 8 9 10
Samplers in partition
MB
/s
Current implementation
Reimplementation
TDAQ
Ingo Scholtes - Summer Student, University of Trier
25
AA
TT
LL
AA
SS
ConclusionConclusion
Reimplementation fulfills all needs Improved speed As seen: Optimal scalability (constant!) Enhanced error recovery Configurable tradeoff between latency
and CPU/bandwidth requirements (tree type unary, binary, …)
Users do not need to care about thread management