Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany,

Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany, ACAT 2013, Beijing1

Contents Introduction to FAIR o CBM and Panda Experiments Why to use Message Queues Current status Results Summary 5/17/13M. Al-Turany, ACAT 2013, Beijing2

FAIR Darmstadt (Germany) GSI Facility for Antiproton and Ion Research Observer: EU, United States, Hungary, Georgia, Saudi Arabia 5/17/13M. Al-Turany, ACAT 2013, Beijing3

5/17/13M. Al-Turany, ACAT 2013, Beijing4

5/17/135 ~400 Physicists 50 Institutes 15 countries ~400 Physicists 50 Institutes 15 countries Physics program: Equation of state at high baryonic density QCD phase transition Critical end point of QCD phase diagram Restoration of Chiral Symmetry Physics program: Equation of state at high baryonic density QCD phase transition Critical end point of QCD phase diagram Restoration of Chiral Symmetry M. Al-Turany, ACAT 2013, Beijing

Anti-Proton ANnihilation at DArmstadt 5/17/136 Physics Programm: Charmonium Spectroscopy Search for Glueballs & Hebrides Hyper nucleus Drell-Yan Physics Electromagnetic form factors Rare decays Physics Programm: Charmonium Spectroscopy Search for Glueballs & Hebrides Hyper nucleus Drell-Yan Physics Electromagnetic form factors Rare decays Data rate ~10 7 Event/s Data volume ~10 PB/yr ~450 Physicists 52 Institutes 17 Countries ~450 Physicists 52 Institutes 17 Countries M. Al-Turany, ACAT 2013, Beijing

The Online Reconstruction and analysis 5/17/13 300 GB/s 20M Evt/s 300 GB/s 20M Evt/s < 1 GB/s 25K Evt/s < 1 GB/s 25K Evt/s How to distribute the processes? How to manage the data flow? How to recover processes when they crash? How to monitor the whole system? How to distribute the processes? How to manage the data flow? How to recover processes when they crash? How to monitor the whole system? 1 TB/s 1 GB/s > 60 000 CPU-core or Equivalent GPU, FPGA, > 60 000 CPU-core or Equivalent GPU, FPGA, M. Al-Turany, ACAT 2013, Beijing7

Trigger-less Data Acquisition 5/17/13M. Al-Turany, ACAT 2013, Beijing8

Design constrains Highly flexible: o different data paths should be modeled. Adaptive: o Sub-systems are continuously under development and improvement Should works for simulated and real data: o developing and debugging the algorithms It should support all possible hardware where the algorithms could run (CPU, GPU, FPGA) It has to scale to any size! With minimum or ideally no effort. 5/17/13M. Al-Turany, ACAT 2013, Beijing9

Before Re-inventing the Wheel What is available on the market and in the community? o A very promising package: ZeroMQ is available since 2 years Do we intend to separate online and offline? NO Multithreaded concept or a message queue based one? o Message based systems allow us to decouple producers from consumers. o We can spread the work to be done over several processes and machines. o We can manage/upgrade/move around programs (processes) independently of each other. 5/17/13M. Al-Turany, ACAT 2013, Beijing10

A socket library that acts as a concurrency framework. Carries messages across inproc, IPC, TCP, and multicast. Connect N-to-N via fanout, pubsub, pipeline, request-reply. Asynch I/O for scalable multicore message-passing apps. 30+ languages including C, C++, Java,.NET, Python. Most OSs including Linux, Windows, OS X, PPC405/PPC440. Large and active open source community. LGPL free software with full commercial support from iMatix. 5/17/1311M. Al-Turany, ACAT 2013, Beijing

Why MQ? A messaging library, which allows you to design a complex communication system without much effort Presents abstraction on higher level than MPI programming model is easier Is suitable for loosely coupled and more general distributed systems 5/17/13M. Al-Turany, ACAT 2013, Beijing12

Zero in MQ Originally the zero in MQ was meant as "zero broker" and (as close to) "zero latency" (as possible). In the meantime it has come to cover different goals: zero administration, zero cost, zero waste. More generally, "zero refers to the culture of minimalism that permeates the project. Adding power by removing complexity rather than exposing new functionality. 5/17/13M. Al-Turany, ACAT 2013, Beijing13

ZeroMQ sockets provide efficient transport options Inter-thread Inter-process Inter-node which is really just inter- process across nodes communication 5/17/13 PMG : Pragmatic General Multicast (a reliable multicast protocol) Named Pipe: Piece of random access memory (RAM) managed by the operating system and exposed to programs through a file descriptor and a named mount point in the file system. It behaves as a first in first out (FIFO) buffer PMG : Pragmatic General Multicast (a reliable multicast protocol) Named Pipe: Piece of random access memory (RAM) managed by the operating system and exposed to programs through a file descriptor and a named mount point in the file system. It behaves as a first in first out (FIFO) buffer M. Al-Turany, ACAT 2013, Beijing14

The built-in core MQ patterns are: Request-reply, which connects a set of clients to a set of services. (remote procedure call and task distribution pattern) Publish-subscribe, which connects a set of publishers to a set of subscribers. (data distribution pattern) Pipeline, which connects nodes in a fan-out / fan-in pattern that can have multiple steps, and loops. (Parallel task distribution and collection pattern) Exclusive pair, which connect two sockets exclusively 5/17/13M. Al-Turany, ACAT 2013, Beijing15

Current Status The Framework deliver some components which can be connected to each other in order to construct a processing pipeline(s). All component share a common base called Device (ZeroMQ Class). Devices are grouped by three categories: o Source: Sampler o Message-based Processor: Sink, BalancedStandaloneSplitter, StandaloneMerger, Buffer o Content-based Processor: Processor 5/17/13M. Al-Turany, ACAT 2013, Beijing16

Design 5/17/13 Experiment/dete ctor specific code Framework classes that can be used directly M. Al-Turany, ACAT 2013, Beijing17

Example for CBM online reconstruction hierarchy 5/17/13 STS data Mvd data Clusterer Tracker Global Event Reconstruction Trigger event selection TRD data TOF data Clusterer Pre-Processing TRD Tracker M. Al-Turany, ACAT 2013, Beijing18

Example for CBM online reconstruction hierarchy Simulation 5/17/13 STS Sampler Mvd Sampler Clusterer Clustrer Tracker Global Event Reconstruction Trigger event selection TRD Sampler TOF Sampler Clustrer Pre-Procceing TRD Tracker M. Al-Turany, ACAT 2013, Beijing19

Message-based Processor All message-based processors inherit from Device and operate on messages without interpreting their content. Four message-based processors have been implemented so far 5/17/13M. Al-Turany, ACAT 2013, Beijing20

5/17/13 TRD data Clusterer TRD Tracker TRD data FairMQBalancedStandaloneSplitter Clustrer FairMQStandaloneMerger Tracker Example for Fan-out/Fan-in the data path for load balancing M. Al-Turany, ACAT 2013, Beijing21

Content-based Processor The Processor device has at least one input and one output socket. A task is meant for accessing and potentially changing the message content. 5/17/13M. Al-Turany, ACAT 2013, Beijing22

Detector Simulation Detector Simulation Example for CBM online reconstruction hierarchy (scenario) STS data Mvd data Clusterer REQ REP Tracker REQ SUB PUB SUB Parameter database PUB SUB PUB SUB REP 5/17/13M. Al-Turany, ACAT 2013, Beijing23

Results Throughput of 940 Mbit/s was measured which is very close to the theoretical limit of the TCP/IPv4/GigabitEthernet The throughput for the named pipe transport between two devices on one node has been measured around 1.7 GB/s 5/17/13M. Al-Turany, ACAT 2013, Beijing24 Each message consists of digits in one panda event for one detector, with size of few kBytes

Payload in Mbyte/s as function of message size 5/17/13M. Al-Turany, ACAT 2013, Beijing25 ZeroMQ works on InfiniBand but using IP over IB

Native InfiniBand/RDMA is faster than IP over IB 5/17/13M. Al-Turany, ACAT 2013, Beijing26 Implementing ZeroMQ over IB verbs will improve the performance.

ZeroMQ Root (Event loop) 5/17/13 FairRootManager FairRunAna FairTasks Init() Re-Init() Exec() Finish() FairTasks Init() Re-Init() Exec() Finish() FairMQProcessorTask Init() Re-Init() Exec() Finish() FairMQProcessorTask Init() Re-Init() Exec() Finish() ROOT Files, Lmd Files, Remote event server, Integrating the existing software: M. Al-Turany, ACAT 2013, Beijing27

Summary ZeroMQ communication layer is integrated into our offline framework (FairRoot) On the short term we will keep both options ROOT based event loop and concurrent processes communicating with each other via ZeroMQ. On long Term we are moving away from single event loop to distributed processes. Thanks you ! 5/17/13M. Al-Turany, ACAT 2013, Beijing28

Device Each processing stage of a pipeline is occupied by a process which executes an instance of the Device class 5/17/13M. Al-Turany, ACAT 2013, Beijing29

Sampler Devices with no inputs are categorized as sources A sampler loops (optionally: infinitely) over the loaded events and send them through the output socket. A variable event rate limiter has been implemented to control the sending speed 5/17/13M. Al-Turany, ACAT 2013, Beijing30

Message format (Protocol) Potentially any content-based processor or any source can change the application protocol. Therefore, the framework provides a generic Message class that works with any arbitrary and continuous junk of memory (FairMQMessage). One has to pass a pointer to the memory buffer, the size in bytes, and can optionally pass a function pointer to a destructor, which will be called once the message object is discarded. 5/17/13M. Al-Turany, ACAT 2013, Beijing31

New simple classes without ROOT are used in the Sampler (This enable us to use non-ROOT clients) and reduce the messages size. 5/17/13M. Al-Turany, ACAT 2013, Beijing32

Documents

Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany,