22
Sept 12 2001 Wyatt Merritt DØ Collaboration Meet ing Plenary S 1 The DØ Computing Model Overview The picture Planning history Status of acquisitions Performance More detail On the current operation On the R & D General Status Future plan

The DØ Computing Model

  • Upload
    sabina

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

The DØ Computing Model. Overview The picture Planning history Status of acquisitions Performance More detail On the current operation On the R & D General Status Future plan. High bandwidth into robot. Overview. The data handling system SAM  ENSTORE  Robot(s) - PowerPoint PPT Presentation

Citation preview

Page 1: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

1

The DØ Computing Model

Overview The picture

Planning history Status of acquisitions Performance More detail

On the current operation On the R & D General Status

Future plan

Page 2: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

2

Overview

The data handling system SAM ENSTORE Robot(s)

The offline user computing systems dØmino - O (20 TB) disk linux analysis server(s) - O (2 TB) disk linux development machines - O (0.2 TB)

• build cluster• ClueDØ• remote linux machines

non-development desktops

Associated systems Fermilab production farm (raw data reconstruction) Remote production farms (simulation) Database servers

High bandwidth into robot

Page 3: The DØ Computing Model

Robot

lxbld

Detector

Analysis Cluster 1

NT Desktops

Linux Compute Server

12.5 Mb/s

Monte Carlo Handled remotely

~ 1 TB

150 Mb/s

ClueDØ

~ 0.2 TB

Linux Farms

Database Servers

dØmino

27 TBHigh speed Network

ClueDØServer

Page 4: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

4

Original plan January ‘97 DØ Internal Review February ‘97

External review: Von Rüden Committee Mar ‘97, Oct ‘97, Jun ‘98, Jan ‘99, Jun ‘99 Funding profile (DMNAG - Joint with CDF) approved ‘97

Plan updates January ‘99 for VR IV Global Computing Model reports (‘98-’99)

[Addition of Analysis Servers to plan]

Plan implementation ‘97 - ‘01 Run II Computing and Software Project: co-leaders +

Computing Planning Board

Planning history

Page 5: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

5

Status of acquisitions

Analysis cpu Dømino: 192 proc O2000 complete (except add memory) Desktops: responsibility of institutions Analysis Clusters/Servers - 1 purchased of (6?)

Reconstruction cpu 200 processors acquired of 400 planned

[ 40 Hz cap @ current reco cpu perf. ; 80 Hz @ target reco perf]

Disk storage 30 TB total - complete (plan was 15 TB) See allocation slide

Page 6: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

6

22.5

0.9

2.6

12

6

1

Allocated

27TOTAL

2contingency

?2Tmp ( group space)

~2.0?4Project disks

variable12DST/mDST

variable6SAM cache

11Scratch, releases & other config.

UsedAvailableDisk space on D0MINO

Total available disk space: 30 Tbyte

( all units are Tbytes)

3 Tbytes are on: D0test, d0lxac1, d0lxbld27 Tbytes are on D0MINO

Disk space in the offline systems

Page 7: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

7

Status of acquisitions cont’d

Robotic tape storage 1 ADIC robot (750 TB capacity) - complete 18 Mammoth II tape drives - will be retired 6 LTO drives - now 2 STK robots (600 TB capacity) - FY02 9 STK 9940 drives - FY02 Post shutdown stopgap - use existing STKen w/ 4 drives

Database servers - complete 2 SUN systems w/ 600 GB disk

Page 8: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

8

Performance

Farm production stats dØmino cpu & mem stats AC1 cpu & mem stats SAM & encp stats Disk usage stats Conclusion: Chief needs

More memory for Dømino More reliable tape drives More farm nodes More linux cpu

Open questions - DB server upgrades?

Page 9: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

9

Farm Production Statistics

See web link from Main DØ Computing for weekly reports

Week of 08/31 - 09/06:

800,000 evts proc / 140,000 from data collected in that week

1.9 M events collected in that week Problems in this week:

encp problem (code change from ENSTORE)disk failure on dØbbin (the farm IO server)several other problems as well...

Page 10: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

10

The Current Operation

Code release model Mapping activities to systems ClueD0 operation Remote farm operation Role of the ORB

Page 11: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

11

The code release model

Weekly test releases Production releases every three months Weekly subsystem coordinators meeting:

Minutes to d0rug mailing list Rules for interface changes Schedules for big disruptive changes (e.g. switch

to KAI 4.0)

Page 12: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

12

Mapping activities to systems

Code development: your Linux box, if possible; d0mino is the backup solution

Large sample processing: a SAM station d0mino, lxac1, special farm allocation (gtr) , (ClueD0 - in

R&D)

Small sample processing: create derived DS on SAM station, transfer to desktop

Office/Web browsing : use your desktop! Remote users: new position to address needs

Page 13: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

13

Mapping activities to systems

Disk usage Home areas - backed up; you can ask for up to 250MB

(possibility of more for good reason) BUT NFS-mounted - don’t use for data files!

TMP areas - not backed up. Code development and / or data files, allocated per institution. 37 institutions are using it so far. A good place to start off if you are not working with a well-defined project.

PRJ areas - not backed up. Code development and / or data files, allocated per project. 3 large pools: commissioning, algorithm development, simulation, plus physics and ID groups and some smaller projects.

Web pages - DØ Main Computing ( SAM Data Handling section) --> General description of where data samples are stored in our system

Page 14: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

14

The current population is:111 nodes with 138 CPUs and a total memory of 37GB396 Users

Rules for joining and policies can be found at:http://www-clued0.fnal.gov/clued0/http://www-clued0.fnal.gov/clued0/policies.html

Current difficulties from the lack of Redhat 7.1 builds are being actively worked on

ClueD0 Operation

Page 15: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

15

Monte Carlo Production Status

Current Software – mcp07 p07.00.05a Generator, DØgstar, Døsim P08.12.00 Døreco, recoanalyze 950 kevents generated at reco level Run IIB Simulation is a major effort Will move to p08.13.00 to remove memory leak

Future Releases – p09.10.00 Problem running DØgstar under investigation Plate level available p10 certification will be available by the end of the

month

Page 16: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

16

Charge: Allocate offline resources according to the experiment’s priorities

Project & tmp disk Sample priorities for simulation on remote farms Partitions in SAM cache Batch queues

Chair: Nick Hadley Web Page

http://www-d0.fnal.gov/Run2Physics/orb/d0_private/orb_home.html

Institutions which have no tmp disk allocation and have active users

email to [email protected] - 18 GB will be allocated

The Offline Resources Board

Page 17: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

17

R & D

Analysis clusters - one in service ClueD0 servers ( a relocated analysis cluster) -

software being tested; networking strategy being developed

Compute servers for dØmino (a user-accessible farm) - 2 nodes available for tests

Remote farms for raw data reconstruction and analysis

Remote desktop analysis

Page 18: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

18

Institutional contributions

Desktop seats Backup tapes Remote simulation capacity Disk for Dømino via budget code - issues

How to allocate between project & tmp? Lifetime for contribution? Unit of contribution : 1 rack of disk

Analysis cluster for Feynman via budget code Similar issues

Analysis cluster for ClueDØ - all the above issues + SAM bandwidth, networking, sysadmin, ...

Page 19: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

19

General Status - Where are the limits/problems?

Online Max rate tested 40 Hz to tape Max rate sustained for a shift, to date ~25 Hz to tape Max rate expected with next iteration 60 Hz to tape Final limitation: tape budget (FY02 = ~ 400 TB )

Running p 10 on the farms Processes raw data @ 23 sec/event Thanks to Alg Group - worked out of box on raw data Limits: ~ 2-3 Hz w/ current nodes & cpu perf of reco

Output size: HUGE - writing too much tape, breaking DB model, using more than allocated network and disk resources all down the line

Page 20: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

20

Expected Farm Performance

@ Current cpuperf

@ Target cpuperf

Existing farm 3 Hz 6 Hz

+ FY01purchase(32 nodes)

5 Hz 10 Hz

+ FY02purchase(200 nodes)

36 Hz 72 Hz

Page 21: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

21

General Status - Where are the limits/problems?

SAM/ENSTORE status Working for many months with servers on automatic

recovery Not all features complete (pick events) 5 GB interfaces can deliver 150 MB/sec to dØmino

Robot status Design rates met, but robustness severely limited by M II

drive error rate - plan switchover by end of shutdown

Page 22: The DØ Computing Model

Sept 12 2001 Wyatt Merritt DØ Collaboration Meeting Plenary Session

22

Future Plan

Major purchases still in FY02 New robot and reliable drives New farm nodes More memory for dØmino *Some* linux cpu

Continue R&D for linux analysis strategies Hope to establish effectiveness and practicality of the

three proposed models: AC, CS, AC@DØ

Operational improvements SAM personnel @ DØ RECO: continue with current release schedules;

emphasize quality control and testing for releases;push on cpu, memory, output size issues