Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 0
NASA Center for Computational Sciences
Hierarchical Storage Managementat the
NASA Center for Computational Sciences:From UniTree to SAM-QFS
[email protected] Computing Branch
Earth and Space Science Computing DivisionNASA Goddard Space Flight Center Code 931
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 1
NASA Center for Computational Sciences
NCCS’s Mission and Customers
• NASA Center for Computational Sciences (NCCS) at NASAGoddard Space Flight Center
• Mission: Enable Earth and space sciences research (via dataassimilation and computational modeling) by providing state-of-the-art facilities in– High Performance Computing (HPC),– Mass Storage– High-speed Networking– HPC Computational Science Expertise
• Earth and space science customers:– Seasonal-to-interannual climate and ocean prediction– Global weather and climate data sets incorporating data
assimilated from numerous land-based and satellite-borneinstruments
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 2
NASA Center for Computational Sciences
NCCS Resources• High Performance Compute Engines
– HP Compaq ES 45 Alphaserver SC (1392p)– SGI Origin 3800s (608p total)– ~4.5 TFLOPs peak total
• Mass Storage Systems and Servers– Mass Data Storage and Delivery System (MDSDS), was UniTree
now SAM-QFS, on Sun Fire 15K, 2 domains, Shared QFS “HA”• ~355 TiB*, ~12M files, DDN S2A 8000 disk arrays
– SGI DMF, Origin 3800 server, to be converted to SAM-QFS via“DMS” (Data Management System, based on SRB)
• ~350 TiB*, ~14M files, HDS 9960 and SGI TP 9x00 disk arrays• Tape Libraries, Intra-Machine/Device Networks, Switches
– Nine STK Powderhorn tape libraries (~51,000 slots)• STK 9840C, STK 9940B, STK 9840A tape drives
– Gigabit Ethernet, Foundry BigIron 15K– 2-Gb Fibre Channel, Brocade Silkworm 12K, 3900s
* Unique (does not include risk mitigation duplicates)
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 3
NASA Center for Computational Sciences
NCCS Mass Storage Growth
0
100
200
300
400
500
600
700
1999 2000 2001 2002 2003
Fiscal Year
To
tal D
ata
(TB
)
0
2
4
6
8
10
12
14
l M
illio
ns
of
File
s
NCCS Mass Data Storage and DeliverySystem (MDSDS) Growth
(Incl
udes
Ris
k M
itiga
tion
Dupl
icat
ed D
ata)
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 4
NASA Center for Computational Sciences
NCCS Projected Requirements• Earth and Space Science drivers: increasing
– Model resolution– Number of assimilated observations– Numbers of concurrent model ensembles
• Total data held (including risk-mitigation duplicates):– Current: ~1.5 PiB– End of FY 2005: ~6 PiB– End of FY 2007: ~19 PiB
• Files (unique):– Current: ~25 million, grows ~33% per year– FY 2005: ~44 million– FY 2007: ~78 million
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 5
NASA Center for Computational Sciences
HSM Evaluation (Spring 2002)
• High-end HSMvendors’ responsesto 60-some technicalquestions– SGI’s DMF– Sun’s SAM-QFS– Legato’s
DiskXtender(UniTree Central FileManager)
– IBM’s HPSS
• NCCS and CSCteam evaluated:– Performance– Integrity, High
Availability– Scalability,
Modularity, Flexibility– Balance (avoiding
bottlenecks)– Manageability
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 6
NASA Center for Computational Sciences
UniTree to SAM-QFS Migration• Employed Sun’s SAM migration toolkit and migration
libraries written by Instrumental, Inc.– Legacy UniTree directory and file “inode” info harvested,
then inserted into SAM-QFS filesystems• Legacy UniTree: ~300 TB, ~11M files• Only 5 days downtime, including QC checks and server
recabling ( ~11M files, ~300K directories)– Transparent user retrieval of legacy files: SAM sees UniTree
files as “stranger” media, so reads files via migration library,then archives to SAM tapes
• Approach requires UniTree system to read legacy files/media– Background migration: via DQDuffy et al. Perl script,
UniTree files pre-staged tape by tape for efficiency
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 7
NASA Center for Computational Sciences
Strong Benefits in NCCS’s Current SAM-QFS Configuration
• Performance observed in daily use: over 10 TB/dayarchived while handling 2+TB/day user traffic
• Shared QFS works well to make the underlyingcluster appear as a single entity
• Using “HA flip-flop” for significant software upgradeshas greatly reduced downtime for significant softwareupgrades
• A test cluster system has been invaluable• Restoring files after accidental deletions much
simpler/faster than previous solution
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 8
NASA Center for Computational Sciences
Lessons Learned• Complexities of clustered HSM systems make
configuration of automated high-availability softwarechallenging
• The “Release Currency Conundrum”:– Software release’s newest features will be the most
immature– Keeping current on OS and HSM patches can help to avoid
significant pitfalls• Make “risk mitigation” duplicate tape copies• Keep your expectations of vendors high
– Great support/cooperation from Sun in getting “TrafficManager” (a.k.a mpxio) to work with 3rd party Fibre ChannelRAID array (DataDirect Networks S2A 8000)
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 9
NASA Center for Computational Sciences
Background Detail
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 10
NASA Center for Computational Sciences
The Large TeamEllen Salmon, Adina Tarshish, Nancy Palm, Tom SchardtNASA Center for Computational Sciences (NCCS)NASA Goddard Space Flight Center Code 931Greenbelt, Maryland [email protected]@[email protected]@nasa.gov
Sanjay Patel, Marty Saletta, Ed Vanderlan,Mike Rouch, Lisa Burns, Dr. Daniel Duffy, Roman TurkevichComputer Sciences Corporation,c/o NASA GSFC Code 931Greenbelt, Maryland [email protected]@[email protected]@[email protected]@[email protected]
Robert Caine, Randall Golay,Craig Flaskerud, Linda Radford,Matt HatleySun Microsystems, Inc.7900 Westpark DriveMcLean, VA, 22102Email:[email protected]@[email protected]@[email protected]
Jeff Paffel, Nathan SchumannInstrumental, Inc.2748 East 82nd StreetBloomington, MN [email protected]@instrumental.com
NCCS MDSDS SAM-QFS User Transfer Traffic
0
500
1000
1500
2000
2500
9/16
/03
9/23
/03
9/30
/03
10/7
/03
10/1
4/03
10/2
1/03
10/2
8/03
11/4
/03
11/1
1/03
11/1
8/03
11/2
5/03
12/2
/03
12/9
/03
12/1
6/03
12/2
3/03
12/3
0/03
1/6/
04
1/13
/04
1/20
/04
1/27
/04
2/3/
04
2/10
/04
2/17
/04
2/24
/04
3/2/
04
3/9/
04
3/16
/04
3/23
/04
3/30
/04
4/6/
04
GiB GiB ret
GiB stor
ems 4/13/2004
From 16 Sep. 2003 to 12 Apr. 2004:• 119.5 TiB stored (2.6 Mfiles)• 37.5 TiB retrieved (1.0 Mfiles)• 157.1 TiB transferred total (3.7 Mfiles)
NCCS MDSDS SAM-QFS Daily Tape Write Activity
0
1
2
3
4
5
6
7
8
9
11/1
/03
11/6
/03
11/1
1/03
11/1
6/03
11/2
1/03
11/2
6/03
12/1
/03
12/6
/03
12/1
1/03
12/1
6/03
12/2
1/03
12/2
6/03
12/3
1/03
1/5/
04
1/10
/04
1/15
/04
1/20
/04
1/25
/04
1/30
/04
2/4/
04
2/9/
04
2/14
/04
2/19
/04
2/24
/04
2/29
/04
3/5/
04
3/10
/04
3/15
/04
3/20
/04
3/25
/04
3/30
/04
4/4/
04
4/9/
04
TiB
/day
0
50
100
150
200
250
300
350
400
450
Kfi
les/
day
total TiBtotal Kfiles
ems 4/13/2004
From 1 Nov. 2003 to 12 Apr. 2004:• 605.5 TiB written• 16.4 million files written(includes Risk Mitigation Duplicates)
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 13
NASA Center for Computational Sciences
The Future (1)
• Further optimize data placement on tape to favordata retrieval– Issue: adequately characterizing retrievals?
• Explore SATA disk as the most nearline part of theHSM hierarchy– NCCS data retrieval profile make this somewhat problematic– But becomes more attractive as time-to-first-data rises on
growing-capacity tape– Not expected to replace tape any time soon
• National Lambda Rail participation: enable largescale, long distance science team collaboration
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 14
NASA Center for Computational Sciences
The Future (2):Data Management System
• Goal: help users manage their data• Based on San Diego Supercomputer Center’s
Storage Resource Broker (SRB) middleware, systemdeveloped by Halcyon Systems, Inc.
• Replaces file system access• Allows for extremely useful metadata and queries,for
monitoring and management,e.g.– File content and provenance– File expiration
• Allows for transparent (to user) migration betweenunderlying HSM
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 15
NASA Center for Computational Sciences
References[1] http://nccs.nasa.gov.[2] Performance Management at an Earth Science Supercomputer Center, Jim McGalliard and
Dick Glassbrook.[3] Storage and Network Bandwidth Requirements Through the Year 2000 for the NASA Center
for Computational Sciences, Ellen Salmon, Proceedings of the fifth Goddard Conference onMass Storage Systems and Technologies, (1996) pp. 273-286.
[4] Mass Storage System Upgrades at the NASA Center for Computational Sciences, A.Tarshish, E. Salmon, M. Macie, and M. Saletta, Proceedings of the Eight NASA GoddardConference on Mass Storage Systems and Technologies, Seventh IEEE Symposium onMass Storage Systems, (2000) pp. 325-334.
[4] UniTree to SAM-QFS Project Plan, Jeff Paffel, Instrumental, Inc., NCCS internal report.[5] UniTree to SAM-QFS Migration Procedure, Daniel Duffy, Computer Sciences Corporation,
NCCS internal report.[6] Sun SAM-FS and Sun SAM-QFS Storage and Archive Management Guide, August 2002;
Sun QFS, Sun SAM-FS, and Sun SAM-QFS File System Administrator’s Guide.[7] http://www.npaci.edu/DICE/SRB/.
Apr. 14, 2004 HSM at the NCCS: from UniTree to SAM-QFS u NASA Goddard IEEE MSST 2004 16
NASA Center for Computational Sciences
Standard Disclaimers and Legalese Eye Chart
• All Trademarks, logos, or otherwise registered identification markers are ownedby their respective parties.
• Disclaimer of Liability: With respect to this presentation, neither the UnitedStates Government nor any of its employees, makes any warranty, express orimplied, including the warranties of merchantability and fitness for a particularpurpose, or assumes any legal liability or responsibility for the accuracy,completeness, or usefulness of any information, apparatus, product, or processdisclosed, or represents that its use would not infringe privately owned rights.
• Disclaimer of Endorsement: Reference herein to any specific commercialproducts, process, or service by trade name, trademark, manufacturer, orotherwise, does not necessarily constitute or imply its endorsement,recommendation, or favoring by the United States Government. In addition,NASA does not endorse or sponsor any commercial product, service, or activity.
• The views and opinions of author(s) expressed herein do not necessarily stateor reflect those of the United States Government and shall not be used foradvertising or product endorsement purposes.