Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis...
If you can't read please download the document
Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany,
Flexible data transport for the online reconstruction of FAIR
experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko
(GSI-IT) 5/17/13M. Al-Turany, ACAT 2013, Beijing1
Slide 2
Contents Introduction to FAIR o CBM and Panda Experiments Why
to use Message Queues Current status Results Summary 5/17/13M.
Al-Turany, ACAT 2013, Beijing2
Slide 3
FAIR Darmstadt (Germany) GSI Facility for Antiproton and Ion
Research Observer: EU, United States, Hungary, Georgia, Saudi
Arabia 5/17/13M. Al-Turany, ACAT 2013, Beijing3
Slide 4
5/17/13M. Al-Turany, ACAT 2013, Beijing4
Slide 5
5/17/135 ~400 Physicists 50 Institutes 15 countries ~400
Physicists 50 Institutes 15 countries Physics program: Equation of
state at high baryonic density QCD phase transition Critical end
point of QCD phase diagram Restoration of Chiral Symmetry Physics
program: Equation of state at high baryonic density QCD phase
transition Critical end point of QCD phase diagram Restoration of
Chiral Symmetry M. Al-Turany, ACAT 2013, Beijing
Slide 6
Anti-Proton ANnihilation at DArmstadt 5/17/136 Physics
Programm: Charmonium Spectroscopy Search for Glueballs &
Hebrides Hyper nucleus Drell-Yan Physics Electromagnetic form
factors Rare decays Physics Programm: Charmonium Spectroscopy
Search for Glueballs & Hebrides Hyper nucleus Drell-Yan Physics
Electromagnetic form factors Rare decays Data rate ~10 7 Event/s
Data volume ~10 PB/yr ~450 Physicists 52 Institutes 17 Countries
~450 Physicists 52 Institutes 17 Countries M. Al-Turany, ACAT 2013,
Beijing
Slide 7
The Online Reconstruction and analysis 5/17/13 300 GB/s 20M
Evt/s 300 GB/s 20M Evt/s < 1 GB/s 25K Evt/s < 1 GB/s 25K
Evt/s How to distribute the processes? How to manage the data flow?
How to recover processes when they crash? How to monitor the whole
system? How to distribute the processes? How to manage the data
flow? How to recover processes when they crash? How to monitor the
whole system? 1 TB/s 1 GB/s > 60 000 CPU-core or Equivalent GPU,
FPGA, > 60 000 CPU-core or Equivalent GPU, FPGA, M. Al-Turany,
ACAT 2013, Beijing7
Slide 8
Trigger-less Data Acquisition 5/17/13M. Al-Turany, ACAT 2013,
Beijing8
Slide 9
Design constrains Highly flexible: o different data paths
should be modeled. Adaptive: o Sub-systems are continuously under
development and improvement Should works for simulated and real
data: o developing and debugging the algorithms It should support
all possible hardware where the algorithms could run (CPU, GPU,
FPGA) It has to scale to any size! With minimum or ideally no
effort. 5/17/13M. Al-Turany, ACAT 2013, Beijing9
Slide 10
Before Re-inventing the Wheel What is available on the market
and in the community? o A very promising package: ZeroMQ is
available since 2 years Do we intend to separate online and
offline? NO Multithreaded concept or a message queue based one? o
Message based systems allow us to decouple producers from
consumers. o We can spread the work to be done over several
processes and machines. o We can manage/upgrade/move around
programs (processes) independently of each other. 5/17/13M.
Al-Turany, ACAT 2013, Beijing10
Slide 11
A socket library that acts as a concurrency framework. Carries
messages across inproc, IPC, TCP, and multicast. Connect N-to-N via
fanout, pubsub, pipeline, request-reply. Asynch I/O for scalable
multicore message-passing apps. 30+ languages including C, C++,
Java,.NET, Python. Most OSs including Linux, Windows, OS X,
PPC405/PPC440. Large and active open source community. LGPL free
software with full commercial support from iMatix. 5/17/1311M.
Al-Turany, ACAT 2013, Beijing
Slide 12
Why MQ? A messaging library, which allows you to design a
complex communication system without much effort Presents
abstraction on higher level than MPI programming model is easier Is
suitable for loosely coupled and more general distributed systems
5/17/13M. Al-Turany, ACAT 2013, Beijing12
Slide 13
Zero in MQ Originally the zero in MQ was meant as "zero broker"
and (as close to) "zero latency" (as possible). In the meantime it
has come to cover different goals: zero administration, zero cost,
zero waste. More generally, "zero refers to the culture of
minimalism that permeates the project. Adding power by removing
complexity rather than exposing new functionality. 5/17/13M.
Al-Turany, ACAT 2013, Beijing13
Slide 14
ZeroMQ sockets provide efficient transport options Inter-thread
Inter-process Inter-node which is really just inter- process across
nodes communication 5/17/13 PMG : Pragmatic General Multicast (a
reliable multicast protocol) Named Pipe: Piece of random access
memory (RAM) managed by the operating system and exposed to
programs through a file descriptor and a named mount point in the
file system. It behaves as a first in first out (FIFO) buffer PMG :
Pragmatic General Multicast (a reliable multicast protocol) Named
Pipe: Piece of random access memory (RAM) managed by the operating
system and exposed to programs through a file descriptor and a
named mount point in the file system. It behaves as a first in
first out (FIFO) buffer M. Al-Turany, ACAT 2013, Beijing14
Slide 15
The built-in core MQ patterns are: Request-reply, which
connects a set of clients to a set of services. (remote procedure
call and task distribution pattern) Publish-subscribe, which
connects a set of publishers to a set of subscribers. (data
distribution pattern) Pipeline, which connects nodes in a fan-out /
fan-in pattern that can have multiple steps, and loops. (Parallel
task distribution and collection pattern) Exclusive pair, which
connect two sockets exclusively 5/17/13M. Al-Turany, ACAT 2013,
Beijing15
Slide 16
Current Status The Framework deliver some components which can
be connected to each other in order to construct a processing
pipeline(s). All component share a common base called Device
(ZeroMQ Class). Devices are grouped by three categories: o Source:
Sampler o Message-based Processor: Sink,
BalancedStandaloneSplitter, StandaloneMerger, Buffer o
Content-based Processor: Processor 5/17/13M. Al-Turany, ACAT 2013,
Beijing16
Slide 17
Design 5/17/13 Experiment/dete ctor specific code Framework
classes that can be used directly M. Al-Turany, ACAT 2013,
Beijing17
Slide 18
Example for CBM online reconstruction hierarchy 5/17/13 STS
data Mvd data Clusterer Tracker Global Event Reconstruction Trigger
event selection TRD data TOF data Clusterer Pre-Processing TRD
Tracker M. Al-Turany, ACAT 2013, Beijing18
Slide 19
Example for CBM online reconstruction hierarchy Simulation
5/17/13 STS Sampler Mvd Sampler Clusterer Clustrer Tracker Global
Event Reconstruction Trigger event selection TRD Sampler TOF
Sampler Clustrer Pre-Procceing TRD Tracker M. Al-Turany, ACAT 2013,
Beijing19
Slide 20
Message-based Processor All message-based processors inherit
from Device and operate on messages without interpreting their
content. Four message-based processors have been implemented so far
5/17/13M. Al-Turany, ACAT 2013, Beijing20
Slide 21
5/17/13 TRD data Clusterer TRD Tracker TRD data
FairMQBalancedStandaloneSplitter Clustrer FairMQStandaloneMerger
Tracker Example for Fan-out/Fan-in the data path for load balancing
M. Al-Turany, ACAT 2013, Beijing21
Slide 22
Content-based Processor The Processor device has at least one
input and one output socket. A task is meant for accessing and
potentially changing the message content. 5/17/13M. Al-Turany, ACAT
2013, Beijing22
Slide 23
Detector Simulation Detector Simulation Example for CBM online
reconstruction hierarchy (scenario) STS data Mvd data Clusterer REQ
REP Tracker REQ SUB PUB SUB Parameter database PUB SUB PUB SUB REP
5/17/13M. Al-Turany, ACAT 2013, Beijing23
Slide 24
Results Throughput of 940 Mbit/s was measured which is very
close to the theoretical limit of the TCP/IPv4/GigabitEthernet The
throughput for the named pipe transport between two devices on one
node has been measured around 1.7 GB/s 5/17/13M. Al-Turany, ACAT
2013, Beijing24 Each message consists of digits in one panda event
for one detector, with size of few kBytes
Slide 25
Payload in Mbyte/s as function of message size 5/17/13M.
Al-Turany, ACAT 2013, Beijing25 ZeroMQ works on InfiniBand but
using IP over IB
Slide 26
Native InfiniBand/RDMA is faster than IP over IB 5/17/13M.
Al-Turany, ACAT 2013, Beijing26 Implementing ZeroMQ over IB verbs
will improve the performance.
Summary ZeroMQ communication layer is integrated into our
offline framework (FairRoot) On the short term we will keep both
options ROOT based event loop and concurrent processes
communicating with each other via ZeroMQ. On long Term we are
moving away from single event loop to distributed processes. Thanks
you ! 5/17/13M. Al-Turany, ACAT 2013, Beijing28
Slide 29
Device Each processing stage of a pipeline is occupied by a
process which executes an instance of the Device class 5/17/13M.
Al-Turany, ACAT 2013, Beijing29
Slide 30
Sampler Devices with no inputs are categorized as sources A
sampler loops (optionally: infinitely) over the loaded events and
send them through the output socket. A variable event rate limiter
has been implemented to control the sending speed 5/17/13M.
Al-Turany, ACAT 2013, Beijing30
Slide 31
Message format (Protocol) Potentially any content-based
processor or any source can change the application protocol.
Therefore, the framework provides a generic Message class that
works with any arbitrary and continuous junk of memory
(FairMQMessage). One has to pass a pointer to the memory buffer,
the size in bytes, and can optionally pass a function pointer to a
destructor, which will be called once the message object is
discarded. 5/17/13M. Al-Turany, ACAT 2013, Beijing31
Slide 32
New simple classes without ROOT are used in the Sampler (This
enable us to use non-ROOT clients) and reduce the messages size.
5/17/13M. Al-Turany, ACAT 2013, Beijing32