Upload
ivi
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Online Performance Monitoring of the Third ALICE Data Challenge. W. Carena 1 , R. Divia 1 , P. Saiz 2 , K. Schossmaier 1 , A. Vascotto 1 , P. Vande Vyvre 1 CERN EP-AID 1 , EP-AIP 2 NEC2001 Varna, Bulgaria 12-18 September 2001. Contents. ALICE Data Challenges - PowerPoint PPT Presentation
Citation preview
Online Performance Monitoring of the Third ALICE Data Challenge
W. Carena1, R. Divia1, P. Saiz2, K. Schossmaier1, A. Vascotto1, P. Vande Vyvre1
CERN EP-AID1, EP-AIP2
NEC2001
Varna, Bulgaria
12-18 September 2001
NEC2001, 12-18 September 2001
2Online Performance Monitoring of the ADC III
ContentsContents
ALICE Data Challenges
Testbed infrastructure
Monitoring system
Performance results
Conclusions
NEC2001, 12-18 September 2001
3Online Performance Monitoring of the ADC III
ALICE Data AcquisitionALICE Data Acquisition
Event Building
~300 nodes
up to 1.25 GB/s
up to 20 GB/s
ALICE detectors
Local Data Concentrators (LDC)
Global Data Collectors (GDC)
CASTOR System
up to 2.5 GB/s
~100 nodes
Mass Storage System
Readout
Final system!
NEC2001, 12-18 September 2001
4Online Performance Monitoring of the ADC III
ALICE Data ChallengesALICE Data Challenges
What? Put together components to demonstrate the feasibility, reliability and performance of our present prototypes.
Where? The ALICE common testbed uses the hardware of the common CERN LHC testbed.
When? This exercise is repeated every year by progressively enlarging the testbed.
Who? Joined effort between the ALICE online and offline group, and two groups of the CERN IT division.
0
500
1000
1500
2000
2500
3000
1999 2000 2001 2002 2003 2004 2005 2006
DAQMass Storage
ADC I: March 1999
ADC II: March-April 2000
ADC III: January-March 2001
ADC IV: 2nd half 2002 ?
NEC2001, 12-18 September 2001
5Online Performance Monitoring of the ADC III
Goals of the ADC IIIGoals of the ADC III
Performance, scalability, and stability of the system (10% of the final system)
300 MB/s event building bandwidth
100 MB/s over the full chain during a week
80 TB into the mass storage system
Online monitoring tools
NEC2001, 12-18 September 2001
6Online Performance Monitoring of the ADC III
ADC III Testbed Hardware ADC III Testbed Hardware
80 standard PCs
dual PIII@800Mhz
Fast and Gigabit Ethernet
Linux kernel 2.2.17
6 switches from 3 manufactures
Copper and fiber media
Fast and Gigabit Ethernet
8 disk servers
dual PIII@700Mhz
20 IDE data disks
750 GB mirrored
3 HP NetServers
12 tape drives
1000 cartridges
60 GB capacity
10 MB/s bandwidth
Farm Network Disks Tapes
NEC2001, 12-18 September 2001
7Online Performance Monitoring of the ADC III
ADC III Monitoring ADC III Monitoring
Minimum requirements LDC/GDC throughput (individual and aggregate) Data volume (individual and aggregate) CPU load (user and system) Identification: time stamp, run number Plots accessible on the Web
Online monitoring tools PEM (Performance and Exception Monitoring) from
CERN IT-PDP was not ready for ADC III Fabric monitoring: developed by CERN IT-PDP ROOT I/O: measures mass storage throughput CASTOR: measures disk/tape/pool statistics DATESTAT: prototype development by EP-AID, EP-AIP
NEC2001, 12-18 September 2001
8Online Performance Monitoring of the ADC III
Fabric MonitoringFabric Monitoring
Collect CPU, network I/O, and swap statistics Send UDP packets to a server Display current status and history using Tcl/Tk scripts
NEC2001, 12-18 September 2001
9Online Performance Monitoring of the ADC III
ROOT I/O MonitoringROOT I/O Monitoring
Measures aggregate throughput to mass storage system Collect measurements in a MySQL data base Display history and histogram using ROOT on Web pages
NEC2001, 12-18 September 2001
10Online Performance Monitoring of the ADC III
DATESTAT Architecture DATESTAT Architecture
DATE v3.7
Log files
(~200 KB/hour/node)
DATE Info Logger
dateStat.c top, DAQCONTROL
Statistics files
Perl script
gnuplot script
http://alicedb.cern.ch/statistics
MySQL data base
C program
gnuplot/CGI script
LDC LDC LDC LDC LDC
GDC
LDC LDC
GDC GDC GDC GDC GDC
NEC2001, 12-18 September 2001
11Online Performance Monitoring of the ADC III
Result 1: DATE standalone run, equal subevent size Result 2: Dependence on subevent size Result 3: Dependence on the number of LDC/GDC Result 4: Full chain, ALICE-like subevents
Selected DATESTAT ResultsSelected DATESTAT Results
NEC2001, 12-18 September 2001
12Online Performance Monitoring of the ADC III
Result 1/1Result 1/1
DATE standalone
11LDCx11GDC nodes, 420...440 KB subevents, 18 hours
Volume: 19.8 TB (4E6 events)Aggregate rate: 304 MB/s
NEC2001, 12-18 September 2001
13Online Performance Monitoring of the ADC III
Result 1/2Result 1/2
DATE standalone
11LDCx11GDC nodes, 420...440 KB subevents, 18 hours
LDC load: 12% user, 27% sysLDC rate: 27.1 MB/s
NEC2001, 12-18 September 2001
14Online Performance Monitoring of the ADC III
Result 1/3Result 1/3
DATE standalone
11LDCx11GDC nodes, 420...440 KB subevents, 18 hours
GDC load: 1% user, 37% sysGDC rate: 27.7 MB/s
NEC2001, 12-18 September 2001
15Online Performance Monitoring of the ADC III
Result 2Result 2
DATE standalone
13LDCx13GDC nodes, 50…60 KB subevents, 1.1 hours
Dependence on subevent sizeAggregate rate: 556 MB/s
NEC2001, 12-18 September 2001
16Online Performance Monitoring of the ADC III
Result 3Result 3
Dependence on #LDC/#GDC
DATE standalone
Gigabit Ethernet:
max. 30 MB/s per LDC
max. 60 MB/s per GDC
NEC2001, 12-18 September 2001
17Online Performance Monitoring of the ADC III
Result 4/1Result 4/1
Full chain
20LDCx13GDC nodes, ALICE-like subevents, 59 hours
Volume: 18.4 TB (3.7E6 events)Aggregate rate: 87.6 MB/s
NEC2001, 12-18 September 2001
18Online Performance Monitoring of the ADC III
Result 4/2Result 4/2
Full chain
20LDCx13GDC nodes, ALICE-like subevents, 59 hours
GDC load: 6% user, 23% sys GDC rate: 6.8 MB/s
NEC2001, 12-18 September 2001
19Online Performance Monitoring of the ADC III
Result 4/3Result 4/3
Full chain
20LDCx13GDC nodes, ALICE-like subevents, 59 hours
LDC load: 0.8% user, 2.7% sys LDC rate: 1.1 MB/s (60 KB, Fast)
NEC2001, 12-18 September 2001
20Online Performance Monitoring of the ADC III
Grand TotalGrand Total
Maximum throughput in DATE: 556 MB/s for symmetric traffic, 350 MB/s for ALICE-like traffic
Maximum throughput in full chain: 120 MB/s without migration, 86 MB/ with migration
Maximum volume per run: 54 TB with DATE standalone, 23.6 TB with full chain
Total volume through DATE: at least 500 TB Total volume through full chain: 110 TB Maximum duration per run: 86 hours Maximum events per run: 21E6 Maximum subevent size: 9 MB Maximum number of nodes: 20x15 Number of runs: 2200
NEC2001, 12-18 September 2001
21Online Performance Monitoring of the ADC III
SummarySummary
Most of the ADC III goals were achieved PC/Linux platforms are stable and reliable Ethernet technology is reliable and scalable DATE standalone is running well Full chain needs to be further analyzed Next ALICE Data Challenge in the 2nd half 2002
Online Performance Monitoring DATESTAT prototype performed well Helped to spot bottlenecks in the DAQ system The team of Zagreb is re-designing and re-engineering
the DATESTAT prototype
NEC2001, 12-18 September 2001
22Online Performance Monitoring of the ADC III
Future WorkFuture Work
Polling agent obtain performance data from all components keep the agent simple, uniform, and extendable support several platforms (UNIX, application software)
Transport&Storage use communication with low overhead maintain common format in central database
Processing apply efficient algorithms to filter and correlate logged data store permanently performance results in a database
Visualization use common GUI (Web-based, ROOT objects) provide different views (levels, time scale, color codes) generate automatically plots, histograms, reports, e-mail, ...