16
8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 olf Rumler, John O’Neall, Philippe Gaillardon, Internal Group IN2P3 Computing Center Villeurbanne, France URL http://www.in2p3.fr/CC

8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

Embed Size (px)

Citation preview

Page 1: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler

BaBar Storage at Lyon

HEPIX and Mass StorageSLAC, California, U.S.A.

8 October 1999

Rolf Rumler, John O’Neall, Philippe Gaillardon, Internal GroupIN2P3 Computing Center

Villeurbanne, FranceURL http://www.in2p3.fr/CC

Page 2: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 2 Rolf Rumler

BABAR Experiment

• High-energy-physics experiment, started in July at SLAC

• The IN2P3 Computing Center is the “mirror” computing site for Babar computing.

• We will receive a copy of all Babar data (well, almost).• Also will produce simulated data, which will be stored

as well as sent to SLAC.• Estimated data rate is on the order of 350 TB per year• SLAC has chosen HPSS to store this data; the CCIN2P3

is following their example.• Our initial goal is to do the same thing as SLAC for

BABAR.• Files >~ 2 GB

Page 3: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 3 Rolf Rumler

How it works

Objectivity

amshpss file

file.lock

HPSS

ooss_Mig

ooss_Pur

ooss_Stage

M

PC

R(1)

R(2)

R(3) (pfpt)

(Creation, Lecture (read), Migration, Purge, Recovery)

L

data

control

(pftp)

Page 4: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 4 Rolf Rumler

HPSS Configuration

• For the moment, Babar only ==> like SLAC• One single Storage Class in one single COS• Tape only = Storagetek Redwoods, 9840 and

MAGSTARs under study• No mirroring• All access to data via pftp_client• Additional tools from SLAC (Andy Hanushevsky)

Page 5: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 5 Rolf Rumler

Objectivity Configuration Summary

• 1 SUN E4500 (4 CPUs) + 2 SUN A3500, in total about 1.1 TB RAID 5, under Veritas VM/FS, with actual BaBar data

• 1 SUN E4500 + 2 SUN A3500 as above, no data yet• 1 SUN E450 (4 CPUs) linked to IBM VSS disk space,

about 400 GB RAID 5, with Veritas: tests starting next week

• Intention: to have different Objy servers for different types of data

Page 6: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 6 Rolf Rumler

Core Server

Page 7: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 7 Rolf Rumler

HPSS Core Server

• RS/6000 F50• 4 CPUs, 1 GB memory• 2 x 4.5 GB mirrored system disks• 24 GB internal SSA disks for SFS (mirrored)• AIX 4.3.2• Ethernet (control network)• DCE, Encina, SAMMI• OMI driver for Redwoods• Access to Storagetek ACL by ACSLS

Page 8: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 8 Rolf Rumler

MoverStations

Page 9: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 9 Rolf Rumler

HPSS Movers

• Preliminary configuration, while waiting for choice of best machine to use with Gigabit Ethernet; also lacking BABAR usage profile

• (Historical problem: Changed from ATM to Hi-speed Ethernet just as HPSS was arriving)

• RS/6000 390, replacement under study (43P260?) • 1 CPU, 256 MB memory• 2 x 4.5 GB mirrored system disks• AIX 4.3.2• Ethernet control network, Fast Ethernet data network

Page 10: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 10 Rolf Rumler

Storagetek 4400 Silos (6)

Page 11: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 11 Rolf Rumler

Performance

• Reminder: Temporary mover/network configuration• Performance limited by:

– Fast Ethernet data path (100 Mbps ==> < 8 MB/sec).

– Mover CPUs: ~50 % occupied. • Punctual transfer: ~ 5 MB/sec per tape• Global rate slower because of cartridge mount and

positioning time, ~ 3.5 MB/sec• Global max transfer rate: > 16 MB/sec (write), ~ 3

MB/sec (read)

Page 12: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 13 Rolf Rumler

Errors during 2nd test (5 days)

0

20

40

60

80

per vol/day per drive/day

Volume mounts

HPSS

non-HPSS

0

1

2

3

4

vol errors drive errors

Total physical errors

Page 13: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 14 Rolf Rumler

Particular problem: Tape errors

• HPSS and Redwood cartridges, at least with our test usage pattern, do not seem to cohabit well, especially for random reading of ~ 2-GB files.

• Redwoods need regular maintenance (every 100 hours or less) ==> need to be scheduled. Need stats from controllers.

• Need effective maintenance from Storagetek.• Need tools to monitor volume and drive errors.• Need for HPSS to react automatically to volume and

drive errors. (Example: unable to dismount cartridge ==> HPSS keeps trying indefinitely; drive errors during writing can turn drive into “black hole”)

Page 14: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 15 Rolf Rumler

The good(?) news

• Storagetek taking our problems seriously• Adopted several measures to “minimize our

dissatisfaction” (thru end of 1999):– Maintenance presence > 1 hour/day– Check cartridges to see if any from known-bad batches– Problem “PINNACLE”, max severity, to handle problems– Procedure to follow up on all tapes and drives sent to

Storagetek for analysis or repair– Permanent spare SD-3 at IN2P3 + replacement priority– Daily log analysis, to monitor errors and report them

back to us – Goal: Anticipate bad vols or drives and replace before

they break

Page 15: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 16 Rolf Rumler

Other problem: HPSS manageability

• SAMMI doesn’t make it for us.• Need to receive a user-configurable subset of the

“alarms and events” messages in a script, which can then take the appropriate actions.

• The “appropriate actions” require that appropriate commands be available in command-line form: – lock a volume or device; – forward a message via e-mail, Patrol, beeper or other

means; • Many messages are not sufficiently precise or

information is lacking.

Page 16: 8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,

8 October 1999 BaBar Storage at CCIN2P3 p. 17 Rolf Rumler

Summary

• Greatest current problem is due to errors from Redwood drives; we are studying this problem with Storagetek France. This problem is exacerbated by the next one.

• Greatest long-term problem is manageability, specifically, the lack of adequate non-graphic interfaces to HPSS to permit effective, automatic error detection, performance monitoring and alarm propagation.