View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Martin L Purschke CHEP 2003
Online and Offline Computing systems in the PHENIX experiment
Martin L. Purschke, for the PHENIX collaboration
Physics Department, Brookhaven National Laboratory
Outline:• PHENIX at a glance
• some numbers (we finished our d-Au run yesterday night)
• server/OS/network transitions
• DAQ performance
• data management
• software frameworks
• the future direction
Martin L Purschke CHEP 2003
PHENIX
PHENIX is one of the big experiments at RHIC.
Martin L Purschke CHEP 2003
PHENIX in numbers
4 spectrometer arms
12 Detector subsystems
350,000 detector channels
Lots of readout electronics
Event size typically 90Kb
Data rate 1.2 - 1.6KHz, 2KHz observed
100-130MB/s typical data rate
expected duty cycle ~50% -> 4TB/day
We saw (as of Thursday noon) almost exactly 40TB of beam data (not counting calibrations etc)
We store the data in STK Tape Silos with HPSS
Finished the RHIC d-Au run yesterday around 8pm EST
Have ~5 weeks of beam studies for pp running
then 3 weeks of polarized protons.
Martin L Purschke CHEP 2003
Run ‘03 highlights
• Replaced essentially all previous Solaris DAQ machines with Linux PC’s
•Maturity of Orbix/Iona made that possible
• Went to Gigabit for all the core DAQ machines
• tripled the CPU power in the countinghouse to perform online calibrations and analysis
• Quadrupled our data logging bandwidth throughout
• boosted our connectivity to the HPSS storage system to 2x1000MBit/s (190MB/s observed)
Martin L Purschke CHEP 2003
Data flow
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
ATMCrossbarSwitch
SubEvent Buffers (26)(see parts of the event) Assembly & Trigger
Processors (~ 23)(see whole events)
Buffer Box
Buffer boxes 2TB disk eachbuffers approx. 6hrs worth of data before they are shipped off to HPSS
ToHPSS
Event Builder
Buffer Box
Martin L Purschke CHEP 2003
Event Builder
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
SEB
ATMCrossbarSwitch
1 Event (parts distributed over all SEB’s)
1 Event (all parts sent to one ATP)
Event Builder
Martin L Purschke CHEP 2003
Data transfer
FORESwitch
Most of the transfer workload is carried by the buffer boxes.
Each has 2 1TB SCSI disk arrays, and dual GigaBit NICs
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
Buffer Box
Disk Arrays
Buffer Box
Disk Arrays
To HPSS
Martin L Purschke CHEP 2003
The Buffer Boxes
Back in 2000, we started out with a SUN Enterprise System -- and got 8MB/s despite our best efforts.
We had what the experts had previously recommended, only to learn that another $45K in hardware and licenses would finally get us what we wanted - perhaps.
Also, with SUN/SGI/IBM/… there is a byte-swapping issue between the buffer box and the Intel-based Event Builder
In 2001, we went with $5K Linux PC’s, and never looked back
In Fall 2002 we upgraded to the then-latest dual 2.4GHz E7500 power machines (SuperMicro)
Each one has 12 160G Seagate disks, 2 1TB file systems, ext3, Software Raid 0 striped across 12 disks, 2 SCSI controllers
iozone gives disk write speeds of 150-170MB/s for file sizes in our ballpark.
Network throughput is about 85MB/s in from the ATPs (regular network) and 100 MB/s out to HPSS (different network, Jumbo frames)
Martin L Purschke CHEP 2003
Throughput stripcharts
40MB/s
This measures the data flow in MB/s through the NIC’s of each buffer box.
Red is the flow from the DAQ.
This is (towards the end of a RHIC store) a modest rate, 2x35 MB/s. Mostly driven by the trigger rate.
Time -->
MB/s
50 Minutes
1st buffer box
2nd buffer box
We can do better...
Martin L Purschke CHEP 2003
A few minutes later...
This is the end of the store shown before. At the end, RHIC lost control of the beam and sprayed our detectors, causing a surge in triggers...
We ended the run, and started the transfer to HPSS
We see about 75 + 105MB/s throughput
The connection is not entirely symmetric; one is through one addt’l switch - that’s the difference.
Martin L Purschke CHEP 2003
A little detour: E.T.
In 1999 or so, the ET software was re-engineered and improved at Jefferson Lab (Carl Timmer et al) from the “DD” software previously developed by us, nice technology transfer project.
It’s the workhorse system for accessing data online in real-time.
Allows a virtually unlimited number of clients to retrieve data according to an event profile (like the label on a FedEx envelope).
Is network-transparent
Data can be “piped” from pool to pool, so subsystem processes can locally funnel events to multiple processes.
Supports all data access modes you can think of and then some - profile-select, load balancing shared, non-exclusive shared, ratio, guaranteed delivery, etc, etc. Very versatile.
Martin L Purschke CHEP 2003
Data Flow (old)
Until recently, we transferred the data through an “ET” pool on the buffer boxes, which was our way to merge the data streams from the ATP’s.
The logger would read the data from the pool, and write them to disk
A separate secondary monitoring pool is fed from the primary with about 20% of the data for online monitoring purposes
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
Master ET Pool
Logger Disk Array
MonitoringET Pool
Dedicated machineBuffer Box
Event receivers
E.T. has served us well, but with 4x the speed it couldn’t keep up. It’s still used for the online monitoring now.
Martin L Purschke CHEP 2003
Buffers, not Events
ATP Master ET Pool Logger Disk Array
Buffer BoxEvent receiver
We don’t transfer single-event data chunks but buffers of events, strictly for network performance reasons -- larger transfers, more speed.
At the back end, we again write buffers, for data integrity we write those buffers in multiples of 8KB -- if something is corrupt, you skip 8K blocks until you find the next buffer header.
But it’s a lot of overhead -- buffers get disassembled into events, fed to the pool, re-assembled into buffers...
Martin L Purschke CHEP 2003
The new stuff...
…called Advanced Multithreaded Logger (AML).
Advances in multi-threading in Linux allowed us to make a logger which receives data from a number of network sockets (ATP’s), one thread per ATP handles the connection.
Writing to the output file can be easily regulated by semaphores (hence threads, not processes).
But the real gain is that the buffers sent from the ATP’s are written to the file as they are. No disassembly/assembly step. Less overhead. Much easier.
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
ATP
Disk Array
Threads
AML
Fewer moving parts, more speed.
And it opens up new possibilities...
Martin L Purschke CHEP 2003
We have a PHENIX standard of compressed data
Gzip
algorithmNew buffer with the compressed one as payload
Add new
buffer hdr
buffer buffer buffer buffer buffer buffer
Gunzip
algorithmOriginal uncompressed buffer restored
… not just zero-suppressed, actually compressed. We write the data in buffers. In the I/O layer:
This is what a file then looks like
On readback:
This is what a file normally looks like
All this is handled completely in the I/O layer, the higher-level routines just receive a buffer as before.
Martin L Purschke CHEP 2003
…but pretty expensive
Compressed PRDF’s are expensive to get. We have used them so far for “slow” data such as calibrations. It’s too computing-intensive to do on the buffer boxes.With the AML where we just move ready-made buffers, the ATP’s could send already compressed buffers. We could distribute the load over many CPU’s. It’s compelling:/data> time dpipe -s f -d f -z /common/b0/eventdata/EVENTDATAxxx_P01-0000077374-0000.PRDFF xx.prdfz112.130u 12.290s 2:21.86 87.7% 0+0k 0+0io 301pf+0w
ls -l /common/b0/eventdata/EVENTDATAxxx_P01-0000077374-0000.PRDFF xx.prdfz 1574395904 Mar 10 01:29 /common/b0/eventdata/EVENTDATAxxx_P01-0000077374-0000.PRDFF 699604992 Mar 10 23:37 xx.prdfz
699604992/1574395904 = 44%
We could shrink the output size to below 50%
Martin L Purschke CHEP 2003
Monitoring/Analysis software
We make use of the multi-threading capabilities of ROOT to look at Histograms/data while data are being processed by a background thread. Required for online monitoring.
That framework, called “pMonitor”, is used for all online monitoring tasks. It can read all kinds of data streams (file, ET, …), it’s all transparent to the user.
Other packages (reconstruction, online calibrations, monitoring…) build on that basic framework. We are about to release a new reconstruction framework (work-name “fun4all”) that supports plugin-style analysis modules. This has already been used in the online analysis in this run.
We have a long and successful tradition of using just one analysis software package in online and offline. One learning curve, one set of libraries. In offline/batch, the online capabilities are just not used (but the package has all the online power still under the hood).
Overall, we have had good success with this “one package fits all” approach.
Martin L Purschke CHEP 2003
Open Source, please
We had a somewhat troubled relationship with proprietary software. Your interests are virtually never in the vendor’s economic interests, and you are left behind.
At this point, we use 3 commercial packages (not counting Solaris compilers, and VxWorks stuff etc)
•Objectivity
•Orbix/Iona
•Insure++
Orbix is used exclusively in the DAQ. Not a problem.
Objectivity is different. We have a concurrent license deal with Objy so collaborators have access to the software.
Still, it is on its way to become a liability. Holds us back at gcc-2.95-3, other packages start to want gcc 3.2.
We used Insure++ to hunt down memory leaks in the past. The freebie Valgrind package does a much better job. Finds leaks and uninitialized var’s that Insure didn’t. And faster (a full Insure rebuild used to take days).
Martin L Purschke CHEP 2003
Data management
With multiple-TB data sets, we had to address the data management. The days when you could have a list of golden runs in your logbook are over.
Early on, we restricted the database to management of meta-data (flat files + database info). Worked so far, but:
We need to distribute files automatically across file servers to balance the server load
Need to avoid needless replication of files and resolve ownership issues (when can a file be deleted? When copied?), also desired replication for popular filesWe always had a “Data Carousel” interface to HPSS that would pool requests and gave us a performance boost of about a factor 7 over a first-come, first-serve model
New additions are ARGO (a Postgres-based file catalog) that keeps track of where resources are located, name-based position-independent identification of files, etc
Also FROG (File Replica On a Grid) that is integrated into pmonitor, API and Web interfaces to ARGO. Don’t miss Irina Sourikova’s talk tomorrow, Cat. 8, 2:30.
Martin L Purschke CHEP 2003
PHENIX Computing Resources
Locally in the countinghouse:
• ~110 Linux machines of different ages ( 26 dual 2.4GHz , lots of 900-1600MHz machines), loosely based on RedHat 7.3
• configuration == reinstallation, very low maintenance
• 8TB local disk space
At the RHIC computing facility:
• 40TB disk (PHENIX’s share)
• 4 tape silos w/ 420 TB each (no danger of running out of space here)
• ~320 fast dual CPU Linux machines (PHENIX’s share)
Martin L Purschke CHEP 2003
Online computing tasks
With the data still present in the Countinghouse, we try to get as much as possible done while the getting is good (access later is much more expensive)
About the last time that all files belonging to a “run” are together in one place
We try to filter and analyze interesting events (calibrations, special triggers, etc) for monitoring purposes and calibrations
The (somewhat ambitious) goal is to have as much of the preparations done as possible so that reconstruction can begin by the time the data leave the countinghouse. Still work in progress, but we’re optimistic.
Martin L Purschke CHEP 2003
Near-Online results
Electron-Positron Invariant Mass Spectrum
J/
8
PHENIXPHENIXONLINEONLINE
J/PHENIXPHENIXONLINEONLINE
Dimuon Invariant Mass Spectrum
0 Invariant Mass Spectrum
PHENIXPHENIXONLINEONLINE
Martin L Purschke CHEP 2003
Upgrade plans
Locally in the countinghouse:
• go to Gigabit (almost) everywhere (replace ATM)
• use compression
• replace older machines, re-claim floor space, get another factor of 2 CPU and disks
• get el cheapo commodity disk space (“545” project, meaning 5TB for 5K$), say, another 10TB or so
In general:
• Move to gcc 3.2, loosely some Redhat 8.x flavor
• fully deploy ARGO, FROG
• we have already augmented Objectivity with convenience databases (Postgres, MySQL), which will play a larger role in the future.
• Network restructuring: Go to Jumbo frame network for DAQ?
Martin L Purschke CHEP 2003
The End. Thank you.
A picture from the PHENIX Webcam