Upload
ferdinand-atkins
View
217
Download
0
Embed Size (px)
Citation preview
8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler
BaBar Storage at Lyon
HEPIX and Mass StorageSLAC, California, U.S.A.
8 October 1999
Rolf Rumler, John O’Neall, Philippe Gaillardon, Internal GroupIN2P3 Computing Center
Villeurbanne, FranceURL http://www.in2p3.fr/CC
8 October 1999 BaBar Storage at CCIN2P3 p. 2 Rolf Rumler
BABAR Experiment
• High-energy-physics experiment, started in July at SLAC
• The IN2P3 Computing Center is the “mirror” computing site for Babar computing.
• We will receive a copy of all Babar data (well, almost).• Also will produce simulated data, which will be stored
as well as sent to SLAC.• Estimated data rate is on the order of 350 TB per year• SLAC has chosen HPSS to store this data; the CCIN2P3
is following their example.• Our initial goal is to do the same thing as SLAC for
BABAR.• Files >~ 2 GB
8 October 1999 BaBar Storage at CCIN2P3 p. 3 Rolf Rumler
How it works
Objectivity
amshpss file
file.lock
HPSS
ooss_Mig
ooss_Pur
ooss_Stage
M
PC
R(1)
R(2)
R(3) (pfpt)
(Creation, Lecture (read), Migration, Purge, Recovery)
L
data
control
(pftp)
8 October 1999 BaBar Storage at CCIN2P3 p. 4 Rolf Rumler
HPSS Configuration
• For the moment, Babar only ==> like SLAC• One single Storage Class in one single COS• Tape only = Storagetek Redwoods, 9840 and
MAGSTARs under study• No mirroring• All access to data via pftp_client• Additional tools from SLAC (Andy Hanushevsky)
8 October 1999 BaBar Storage at CCIN2P3 p. 5 Rolf Rumler
Objectivity Configuration Summary
• 1 SUN E4500 (4 CPUs) + 2 SUN A3500, in total about 1.1 TB RAID 5, under Veritas VM/FS, with actual BaBar data
• 1 SUN E4500 + 2 SUN A3500 as above, no data yet• 1 SUN E450 (4 CPUs) linked to IBM VSS disk space,
about 400 GB RAID 5, with Veritas: tests starting next week
• Intention: to have different Objy servers for different types of data
8 October 1999 BaBar Storage at CCIN2P3 p. 6 Rolf Rumler
Core Server
8 October 1999 BaBar Storage at CCIN2P3 p. 7 Rolf Rumler
HPSS Core Server
• RS/6000 F50• 4 CPUs, 1 GB memory• 2 x 4.5 GB mirrored system disks• 24 GB internal SSA disks for SFS (mirrored)• AIX 4.3.2• Ethernet (control network)• DCE, Encina, SAMMI• OMI driver for Redwoods• Access to Storagetek ACL by ACSLS
8 October 1999 BaBar Storage at CCIN2P3 p. 8 Rolf Rumler
MoverStations
8 October 1999 BaBar Storage at CCIN2P3 p. 9 Rolf Rumler
HPSS Movers
• Preliminary configuration, while waiting for choice of best machine to use with Gigabit Ethernet; also lacking BABAR usage profile
• (Historical problem: Changed from ATM to Hi-speed Ethernet just as HPSS was arriving)
• RS/6000 390, replacement under study (43P260?) • 1 CPU, 256 MB memory• 2 x 4.5 GB mirrored system disks• AIX 4.3.2• Ethernet control network, Fast Ethernet data network
8 October 1999 BaBar Storage at CCIN2P3 p. 10 Rolf Rumler
Storagetek 4400 Silos (6)
8 October 1999 BaBar Storage at CCIN2P3 p. 11 Rolf Rumler
Performance
• Reminder: Temporary mover/network configuration• Performance limited by:
– Fast Ethernet data path (100 Mbps ==> < 8 MB/sec).
– Mover CPUs: ~50 % occupied. • Punctual transfer: ~ 5 MB/sec per tape• Global rate slower because of cartridge mount and
positioning time, ~ 3.5 MB/sec• Global max transfer rate: > 16 MB/sec (write), ~ 3
MB/sec (read)
8 October 1999 BaBar Storage at CCIN2P3 p. 13 Rolf Rumler
Errors during 2nd test (5 days)
0
20
40
60
80
per vol/day per drive/day
Volume mounts
HPSS
non-HPSS
0
1
2
3
4
vol errors drive errors
Total physical errors
8 October 1999 BaBar Storage at CCIN2P3 p. 14 Rolf Rumler
Particular problem: Tape errors
• HPSS and Redwood cartridges, at least with our test usage pattern, do not seem to cohabit well, especially for random reading of ~ 2-GB files.
• Redwoods need regular maintenance (every 100 hours or less) ==> need to be scheduled. Need stats from controllers.
• Need effective maintenance from Storagetek.• Need tools to monitor volume and drive errors.• Need for HPSS to react automatically to volume and
drive errors. (Example: unable to dismount cartridge ==> HPSS keeps trying indefinitely; drive errors during writing can turn drive into “black hole”)
8 October 1999 BaBar Storage at CCIN2P3 p. 15 Rolf Rumler
The good(?) news
• Storagetek taking our problems seriously• Adopted several measures to “minimize our
dissatisfaction” (thru end of 1999):– Maintenance presence > 1 hour/day– Check cartridges to see if any from known-bad batches– Problem “PINNACLE”, max severity, to handle problems– Procedure to follow up on all tapes and drives sent to
Storagetek for analysis or repair– Permanent spare SD-3 at IN2P3 + replacement priority– Daily log analysis, to monitor errors and report them
back to us – Goal: Anticipate bad vols or drives and replace before
they break
8 October 1999 BaBar Storage at CCIN2P3 p. 16 Rolf Rumler
Other problem: HPSS manageability
• SAMMI doesn’t make it for us.• Need to receive a user-configurable subset of the
“alarms and events” messages in a script, which can then take the appropriate actions.
• The “appropriate actions” require that appropriate commands be available in command-line form: – lock a volume or device; – forward a message via e-mail, Patrol, beeper or other
means; • Many messages are not sufficiently precise or
information is lacking.
8 October 1999 BaBar Storage at CCIN2P3 p. 17 Rolf Rumler
Summary
• Greatest current problem is due to errors from Redwood drives; we are studying this problem with Storagetek France. This problem is exacerbated by the next one.
• Greatest long-term problem is manageability, specifically, the lack of adequate non-graphic interfaces to HPSS to permit effective, automatic error detection, performance monitoring and alarm propagation.