26
TDAQ TDAQ A T L A S Reimplementation of the Reimplementation of the ATLAS Online Event ATLAS Online Event Monitoring Subsystem Monitoring Subsystem Ingo Scholtes Summer Student University of Trier Supervisor: Serguei Kolos

TDAQ ATLAS Reimplementation of the ATLAS Online Event Monitoring Subsystem Ingo Scholtes Summer Student University of Trier Supervisor: Serguei Kolos

Embed Size (px)

Citation preview

TDAQTDAQ

AA

TT

LL

AA

SS

Reimplementation of the ATLAS Reimplementation of the ATLAS Online Event Monitoring Online Event Monitoring SubsystemSubsystem

Ingo ScholtesSummer StudentUniversity of Trier

Supervisor: Serguei Kolos

TDAQ

Ingo Scholtes - Summer Student, University of Trier

2

AA

TT

LL

AA

SS

OutlineOutline Online Monitoring Current implementation and its drawbacks My reimplementation Performance comparison Conclusion

TDAQ

Ingo Scholtes - Summer Student, University of Trier

3

AA

TT

LL

AA

SS

Getting the context: Atlas Online Getting the context: Atlas Online SoftwareSoftware

System of the Atlas Trigger DAQ Project Main purpose: configure, control and monitor data acquisition system Provides a GUI, which allows to control the data acquisition system “Glue” of several TDAQ sub-systems Open Source project

Configuration Monitoring

Control Tutorial

Online SoftwareOnline Software

<<subsystem>>

Information Service

<<subsystem>>

HistogramDistribution

<<subsystem>>

ErrorReporting

<<subsystem>><<subsystem>>MonitoringMonitoring

<<subsystem>>

Event SampleMonitoring

TDAQ

Ingo Scholtes - Summer Student, University of Trier

4

AA

TT

LL

AA

SS

Criteria D

Criteria D

Criteria A

Criteria B

Criteria A

A brief overview of the ATLAS Online A brief overview of the ATLAS Online Event Monitoring SubsystemEvent Monitoring Subsystem

Detector Electronics

Read-Out-Driver (ROD)

ROD

Read-Out-System(ROS)

Event Sampler

Monitor Task

Monitor Task

Monitor Task

Monitor Task

Event Sampler

Event Builder

ROS

ROD ROD

USERCODE

USERCODE

USERCODE

USERCODE

USERCODE

USERCODE

A+B

D D

CDA+B

A+BC

PhysicistsPhysicists

DataFlowDataFlow

Event SamplerUSERCODE

D

DD

TDAQ

Ingo Scholtes - Summer Student, University of Trier

5

AA

TT

LL

AA

SS

The current implementation in The current implementation in detaildetail

Monitor TaskDistributor

SamplerDetector1/Crate1

Buffer

BufferSampler

Detector1/Crate2

Monitor Task

Monitor Task

Machine AMachine B

Machine C

Machine D

Machine E

Machine F

SELECTDetector1/Crate1

START

ITERATOR

ITERATOR

SELECTDetector1/Crate1

SELECTDetector1/Crate2

START

ITERATOR

TDAQ

Ingo Scholtes - Summer Student, University of Trier

6

AA

TT

LL

AA

SS

Drawbacks…Drawbacks… Scalability problem due to bottleneck in

machine A Monitors will not notice if sampler

crashed They just stop receiving events…

Users will have to worry about thread management in sampler Start thread on StartSampling Cleanly exit it on StopSampling Often causes problems

TDAQ

Ingo Scholtes - Summer Student, University of Trier

7

AA

TT

LL

AA

SS

……and what we learn from and what we learn from themthem

Core of scalability problem: central distributor

Bottleneck due to… routing of events through central

distributor multiple distribution of identical

events to different monitors

TDAQ

Ingo Scholtes - Summer Student, University of Trier

8

AA

TT

LL

AA

SS

Implementation Implementation requirementsrequirements

Platform independent C++ Using Online Monitoring IPC based

on CORBA (omniORB 4) minimal and deterministic effect on

the data flow system performance High scalability Get rid of all drawbacks… ;-)

TDAQ

Ingo Scholtes - Summer Student, University of Trier

9

AA

TT

LL

AA

SS

What is really crucial?What is really crucial? Sampler has to decide about criteria

saves a lot of bandwidth Sampler has to send each event once

(per selection criteria) Distributor necessary to protect sampler

from inrushing monitors (gatekeeper function)

TDAQ

Ingo Scholtes - Summer Student, University of Trier

10

AA

TT

LL

AA

SS

Basic improvement ideasBasic improvement ideas Get rid of distributor for communication P2P Moving load to monitors for means of scalability

Current: share bandwidth, accumulate load Idea: share load, accumulate bandwidth ;-) Distributor only for connection management and error

recovery Keeping only crucial things in sampler

Criteria decisions One-time sending of each event ( at least one

connection per sampler/criteria) Sampler thread management

Start sampling thread with first subscription End sampling thread with loss of last subscription User code not aware of threads

TDAQ

Ingo Scholtes - Summer Student, University of Trier

11

AA

TT

LL

AA

SS

Getting rid of the distributor bottleneckGetting rid of the distributor bottleneck can be obtained by using P2P paradigm

One monitor per sampler

Multiple monitors per sampler?

EM

ES

ES

EM ES

ESEM

EM

We need a way to prevent bottleneck here!

EM

ESEM

EM

ES

EM

EM

EM

TDAQ

Ingo Scholtes - Summer Student, University of Trier

12

AA

TT

LL

AA

SS

Introducing the monitor Introducing the monitor treetree

Monitor

Monitor

Monitor

Monitor

Monitor

Monitor

Monitor

Distributor

Sampler

Bartering time/bandwidth requirements with latencyCosts of distribution to monitors independent of number of monitorsConfigured type (unary=list, binary, …) influences latency/bandwidth tradeoffBut: new problems arise with this structure

Example: Binary Monitor tree

TDAQ

Ingo Scholtes - Summer Student, University of Trier

13

AA

TT

LL

AA

SS

Problem 1: exit of Problem 1: exit of monitorsmonitors Each monitor acts as a sampler for his

children Exits/crashes of monitors critical… We have to distinguish between

different types of exits Leaf monitor trivial Monitor with outdegree > 0 more

complicated Root monitor critical

TDAQ

Ingo Scholtes - Summer Student, University of Trier

14

AA

TT

LL

AA

SS

Solutions – Leaf monitor exitSolutions – Leaf monitor exit

Event Distributor

Event Monitor

EventMonitor

EventMonitor

Event Sampler

Trivial operation, just delete the monitor from the tree!

TDAQ

Ingo Scholtes - Summer Student, University of Trier

15

AA

TT

LL

AA

SS

EventMonitor

Solutions – Monitor with outdegree > 0 Solutions – Monitor with outdegree > 0 exitsexits

Event Distributor

Event Monitor

EventMonitor

EventMonitor

Event Sampler

EventMonitor

More complicated, but the distributor can do it, as he hasknowledge of the whole tree, O(C) complexity with C beingconstant maximum number of children

TDAQ

Ingo Scholtes - Summer Student, University of Trier

16

AA

TT

LL

AA

SS

Solutions – Root monitor exitSolutions – Root monitor exit

Event Distributor

Event Monitor

EventMonitor

EventMonitor

Event Sampler

Critical operation, as sampler is involved, but possible to do it transparently for other

monitors, again O(1) complexity

TDAQ

Ingo Scholtes - Summer Student, University of Trier

17

AA

TT

LL

AA

SS

Problem 2: error recoveryProblem 2: error recovery Crash of sampler

Distributor pings all samplers in reasonable intervals can notify monitors about crash

Crash of arbitrary monitors Detected like normal exit! no problem

Crash of distributor No influence on ongoing data exchange Just restart…

TDAQ

Ingo Scholtes - Summer Student, University of Trier

18

AA

TT

LL

AA

SS

Comparing performance…Current implementation

Reimplementation

Sampler i

Monitor i

Distributor

)(2 eOecheCch

ii

constant)(2 eOe

)(2

)(2

1

:

emOemcreme

mOm

m

s

ii

:Run

Shutdown&Init

0

)(2

:Run

:Shutdown&Init mOm

monitors #

bytes sampled #

onitorchildren/m max.

imonitor ofchildren #

samplers #

isampler in criteria#

m

e

C

ch

s

cr

i

i

)(2 iii creOcrecr )(2 iii creOcrecr

TDAQ

Ingo Scholtes - Summer Student, University of Trier

19

AA

TT

LL

AA

SS

ROD Profile ROD Profile (10.000 Events @ 4K)(10.000 Events @ 4K)

event rate per monitor/samplerROD Profile (4K/event)

0

500

1000

1500

2000

2500

3000

1 2 3 4 5 6 7 8 9 10

Samplers in partition

Ev

en

ts/s

Reimplementation

Current implementation

TDAQ

Ingo Scholtes - Summer Student, University of Trier

20

AA

TT

LL

AA

SS

ROD Profile ROD Profile (10.000 Events @ 4K)(10.000 Events @ 4K)

total data transfer rateROD profile (4K/event)

0

10

20

30

40

50

60

70

80

90

100

110

120

1 2 3 4 5 6 7 8 9 10

Samplers in partition

MB

/s

Current implementation

Reimplementation

TDAQ

Ingo Scholtes - Summer Student, University of Trier

21

AA

TT

LL

AA

SS

ROS profile ROS profile (10.000 Events @ 30 K)(10.000 Events @ 30 K)

event rate per monitor/samplerROS Profile (30K/event)

0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8 9 10

Samplers in partition

Ev

en

ts/s

Current implementation

Reimplementation

TDAQ

Ingo Scholtes - Summer Student, University of Trier

22

AA

TT

LL

AA

SS

ROS profile ROS profile (10.000 Events @ 30 K)(10.000 Events @ 30 K)

total data transfer rateROS profile (30K/event)

0

10

20

30

40

50

60

70

80

90

100

110

1 2 3 4 5 6 7 8 9 10

Samplers in partition

MB

/s

Current implementation

Reimplementation

TDAQ

Ingo Scholtes - Summer Student, University of Trier

23

AA

TT

LL

AA

SS

EB profile EB profile (100 Events @ 2MB)(100 Events @ 2MB)

event rate per monitor/samplerEB Profile (2MB/event)

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

Samplers in partition

Ev

en

ts/s

Current implementation

Reimplementation

TDAQ

Ingo Scholtes - Summer Student, University of Trier

24

AA

TT

LL

AA

SS

EB profile EB profile (100 Events @ 2MB)(100 Events @ 2MB)

total data transfer rateEB Profile (2MB/Event)

0

10

20

30

40

50

60

70

80

90

100

110

1 2 3 4 5 6 7 8 9 10

Samplers in partition

MB

/s

Current implementation

Reimplementation

TDAQ

Ingo Scholtes - Summer Student, University of Trier

25

AA

TT

LL

AA

SS

ConclusionConclusion

Reimplementation fulfills all needs Improved speed As seen: Optimal scalability (constant!) Enhanced error recovery Configurable tradeoff between latency

and CPU/bandwidth requirements (tree type unary, binary, …)

Users do not need to care about thread management

TDAQ

Ingo Scholtes - Summer Student, University of Trier

26

AA

TT

LL

AA

SS

Thanks…Thanks… …for your

attention! Questions? Criticism?