22
Online Performance Monitoring of the Third ALICE Data Challenge W. Carena 1 , R. Divia 1 , P. Saiz 2 , K. Schossmaier 1 , A. Vascotto 1 , P. Vande Vyvre 1 CERN EP-AID 1 , EP-AIP 2 NEC2001 Varna, Bulgaria 12-18 September 2001

Online Performance Monitoring of the Third ALICE Data Challenge

  • Upload
    ivi

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Online Performance Monitoring of the Third ALICE Data Challenge. W. Carena 1 , R. Divia 1 , P. Saiz 2 , K. Schossmaier 1 , A. Vascotto 1 , P. Vande Vyvre 1 CERN EP-AID 1 , EP-AIP 2 NEC2001 Varna, Bulgaria 12-18 September 2001. Contents. ALICE Data Challenges - PowerPoint PPT Presentation

Citation preview

Page 1: Online Performance Monitoring of the Third ALICE Data Challenge

Online Performance Monitoring of the Third ALICE Data Challenge

W. Carena1, R. Divia1, P. Saiz2, K. Schossmaier1, A. Vascotto1, P. Vande Vyvre1

CERN EP-AID1, EP-AIP2

NEC2001

Varna, Bulgaria

12-18 September 2001

Page 2: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

2Online Performance Monitoring of the ADC III

ContentsContents

ALICE Data Challenges

Testbed infrastructure

Monitoring system

Performance results

Conclusions

Page 3: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

3Online Performance Monitoring of the ADC III

ALICE Data AcquisitionALICE Data Acquisition

Event Building

~300 nodes

up to 1.25 GB/s

up to 20 GB/s

ALICE detectors

Local Data Concentrators (LDC)

Global Data Collectors (GDC)

CASTOR System

up to 2.5 GB/s

~100 nodes

Mass Storage System

Readout

Final system!

Page 4: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

4Online Performance Monitoring of the ADC III

ALICE Data ChallengesALICE Data Challenges

What? Put together components to demonstrate the feasibility, reliability and performance of our present prototypes.

Where? The ALICE common testbed uses the hardware of the common CERN LHC testbed.

When? This exercise is repeated every year by progressively enlarging the testbed.

Who? Joined effort between the ALICE online and offline group, and two groups of the CERN IT division.

0

500

1000

1500

2000

2500

3000

1999 2000 2001 2002 2003 2004 2005 2006

DAQMass Storage

ADC I: March 1999

ADC II: March-April 2000

ADC III: January-March 2001

ADC IV: 2nd half 2002 ?

Page 5: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

5Online Performance Monitoring of the ADC III

Goals of the ADC IIIGoals of the ADC III

Performance, scalability, and stability of the system (10% of the final system)

300 MB/s event building bandwidth

100 MB/s over the full chain during a week

80 TB into the mass storage system

Online monitoring tools

Page 6: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

6Online Performance Monitoring of the ADC III

ADC III Testbed Hardware ADC III Testbed Hardware

80 standard PCs

dual PIII@800Mhz

Fast and Gigabit Ethernet

Linux kernel 2.2.17

6 switches from 3 manufactures

Copper and fiber media

Fast and Gigabit Ethernet

8 disk servers

dual PIII@700Mhz

20 IDE data disks

750 GB mirrored

3 HP NetServers

12 tape drives

1000 cartridges

60 GB capacity

10 MB/s bandwidth

Farm Network Disks Tapes

Page 7: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

7Online Performance Monitoring of the ADC III

ADC III Monitoring ADC III Monitoring

Minimum requirements LDC/GDC throughput (individual and aggregate) Data volume (individual and aggregate) CPU load (user and system) Identification: time stamp, run number Plots accessible on the Web

Online monitoring tools PEM (Performance and Exception Monitoring) from

CERN IT-PDP was not ready for ADC III Fabric monitoring: developed by CERN IT-PDP ROOT I/O: measures mass storage throughput CASTOR: measures disk/tape/pool statistics DATESTAT: prototype development by EP-AID, EP-AIP

Page 8: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

8Online Performance Monitoring of the ADC III

Fabric MonitoringFabric Monitoring

Collect CPU, network I/O, and swap statistics Send UDP packets to a server Display current status and history using Tcl/Tk scripts

Page 9: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

9Online Performance Monitoring of the ADC III

ROOT I/O MonitoringROOT I/O Monitoring

Measures aggregate throughput to mass storage system Collect measurements in a MySQL data base Display history and histogram using ROOT on Web pages

Page 10: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

10Online Performance Monitoring of the ADC III

DATESTAT Architecture DATESTAT Architecture

DATE v3.7

Log files

(~200 KB/hour/node)

DATE Info Logger

dateStat.c top, DAQCONTROL

Statistics files

Perl script

gnuplot script

http://alicedb.cern.ch/statistics

MySQL data base

C program

gnuplot/CGI script

LDC LDC LDC LDC LDC

GDC

LDC LDC

GDC GDC GDC GDC GDC

Page 11: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

11Online Performance Monitoring of the ADC III

Result 1: DATE standalone run, equal subevent size Result 2: Dependence on subevent size Result 3: Dependence on the number of LDC/GDC Result 4: Full chain, ALICE-like subevents

Selected DATESTAT ResultsSelected DATESTAT Results

Page 12: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

12Online Performance Monitoring of the ADC III

Result 1/1Result 1/1

DATE standalone

11LDCx11GDC nodes, 420...440 KB subevents, 18 hours

Volume: 19.8 TB (4E6 events)Aggregate rate: 304 MB/s

Page 13: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

13Online Performance Monitoring of the ADC III

Result 1/2Result 1/2

DATE standalone

11LDCx11GDC nodes, 420...440 KB subevents, 18 hours

LDC load: 12% user, 27% sysLDC rate: 27.1 MB/s

Page 14: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

14Online Performance Monitoring of the ADC III

Result 1/3Result 1/3

DATE standalone

11LDCx11GDC nodes, 420...440 KB subevents, 18 hours

GDC load: 1% user, 37% sysGDC rate: 27.7 MB/s

Page 15: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

15Online Performance Monitoring of the ADC III

Result 2Result 2

DATE standalone

13LDCx13GDC nodes, 50…60 KB subevents, 1.1 hours

Dependence on subevent sizeAggregate rate: 556 MB/s

Page 16: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

16Online Performance Monitoring of the ADC III

Result 3Result 3

Dependence on #LDC/#GDC

DATE standalone

Gigabit Ethernet:

max. 30 MB/s per LDC

max. 60 MB/s per GDC

Page 17: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

17Online Performance Monitoring of the ADC III

Result 4/1Result 4/1

Full chain

20LDCx13GDC nodes, ALICE-like subevents, 59 hours

Volume: 18.4 TB (3.7E6 events)Aggregate rate: 87.6 MB/s

Page 18: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

18Online Performance Monitoring of the ADC III

Result 4/2Result 4/2

Full chain

20LDCx13GDC nodes, ALICE-like subevents, 59 hours

GDC load: 6% user, 23% sys GDC rate: 6.8 MB/s

Page 19: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

19Online Performance Monitoring of the ADC III

Result 4/3Result 4/3

Full chain

20LDCx13GDC nodes, ALICE-like subevents, 59 hours

LDC load: 0.8% user, 2.7% sys LDC rate: 1.1 MB/s (60 KB, Fast)

Page 20: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

20Online Performance Monitoring of the ADC III

Grand TotalGrand Total

Maximum throughput in DATE: 556 MB/s for symmetric traffic, 350 MB/s for ALICE-like traffic

Maximum throughput in full chain: 120 MB/s without migration, 86 MB/ with migration

Maximum volume per run: 54 TB with DATE standalone, 23.6 TB with full chain

Total volume through DATE: at least 500 TB Total volume through full chain: 110 TB Maximum duration per run: 86 hours Maximum events per run: 21E6 Maximum subevent size: 9 MB Maximum number of nodes: 20x15 Number of runs: 2200

Page 21: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

21Online Performance Monitoring of the ADC III

SummarySummary

Most of the ADC III goals were achieved PC/Linux platforms are stable and reliable Ethernet technology is reliable and scalable DATE standalone is running well Full chain needs to be further analyzed Next ALICE Data Challenge in the 2nd half 2002

Online Performance Monitoring DATESTAT prototype performed well Helped to spot bottlenecks in the DAQ system The team of Zagreb is re-designing and re-engineering

the DATESTAT prototype

Page 22: Online Performance Monitoring of the Third ALICE Data Challenge

NEC2001, 12-18 September 2001

22Online Performance Monitoring of the ADC III

Future WorkFuture Work

Polling agent obtain performance data from all components keep the agent simple, uniform, and extendable support several platforms (UNIX, application software)

Transport&Storage use communication with low overhead maintain common format in central database

Processing apply efficient algorithms to filter and correlate logged data store permanently performance results in a database

Visualization use common GUI (Web-based, ROOT objects) provide different views (levels, time scale, color codes) generate automatically plots, histograms, reports, e-mail, ...