View
214
Download
0
Tags:
Embed Size (px)
Citation preview
1
Maui High Performance Computing Center
Open System Support
An AFRL, MHPCC and UH Collaboration
December 18, 2007
Mike McCraneyMHPCC Operations Director
2
Agenda
MHPCC Background and History Open System Description Scheduled and Unscheduled Maintenance Application Process Additional Information Required Summary and Q/A
3
An AFRL Center
An Air Force Research Laboratory Center
Operational since 1993
Managed by the University of Hawaii• Subcontractor Partners – SAIC / Boeing
A DoD High Performance Computing Modernization Program (HPCMP) Distributed Center
Task Order Contract – Maximum Estimated Ordering Value = $181,000,000• Performance Dependent – 10 Years• 4 Year Base Period with 2, 3-Year Term Awards
An Air Force Research Laboratory Center
Operational since 1993
Managed by the University of Hawaii• Subcontractor Partners – SAIC / Boeing
A DoD High Performance Computing Modernization Program (HPCMP) Distributed Center
Task Order Contract – Maximum Estimated Ordering Value = $181,000,000• Performance Dependent – 10 Years• 4 Year Base Period with 2, 3-Year Term Awards
4
A DoD HPCMP Distributed Center
Distributed CentersDistributed Centers Allocated Distributed Centers
• Army High Performance Computing Research Center (AHPCRC)• Arctic Region Supercomputing Center (ARSC)• Maui High Performance Computing Center (MHPCC)• Space and Missile Defense Command (SMDC)
Dedicated Distributed Centers• ATC• AFWA• AEDC• AFRL/IF• Eglin • FNMOC• JFCOM/J9
Major Shared Resource CentersMajor Shared Resource Centers Aeronautical Systems Center (ASC) Army Research Laboratory (ARL) Engineer Research and Development Center (ERDC) Naval Oceanographic Office (NAVO)
High Performance Computing Modernization ProgramHigh Performance Computing Modernization ProgramHigh Performance Computing Modernization ProgramHigh Performance Computing Modernization Program
Director,Director,Defense Research and EngineeringDefense Research and Engineering
Director,Director,Defense Research and EngineeringDefense Research and Engineering
DUSDDUSD(Science and Technology)(Science and Technology)
DUSDDUSD(Science and Technology)(Science and Technology)
• NAWC-AD• NAWC-CD• NUWC • RTTCRTTC• SIMAFSIMAF• SSCSDSSCSD• WSMRWSMR
5
MHPCC HPC History
MHPCC HPC Growth
0
2000
4000
6000
8000
10000
12000
14000
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Mem
ory
/Pro
cess
ors
0
50
100
150
200
250
300
350
Dis
k (T
B)/
TF
lop
s
Processors Memory Disk Tflops
1994 - IBM P2SC Typhoon Installed
1996 - 2000 IBM P2SC
2000 - IBM P3 Tempest Installed
2001 - IBM Netfinity Huinalu Installed
2002 - IBM P2SC Typhoon Retired
2002 - IBM P4 Tempest Installed
2004 - LNXi Evolocity II Koa Installed
2005 - Cray XD1 Hoku Installed
2006 - IBM P3 Tempest Retired
2007 - IBM P4 Tempest Reassigned
2007 - Dell Poweredge Jaws Installed
6
Eight, 32 processor/32GB “nodes” IBM P690 Power4
Jobs may be scheduled across nodes for a total of 288p
Shared memory jobs can span up to 32p and 32GB
10TB Shared Disk available to all nodes
LoadLeveler Scheduling
One job per node – 32p chunks – can only support 8 simultaneous jobs
Issues:
Old technology, reaching end of life, upgradability issues
Cost prohibitive – Power consumption constant ~$400,000 annual power cost
Hurricane Configuration Summary
Current Hurricane Configuration:
7
Dell Configuration Summary
Proposed Shark Configuration: 40, 4 processor/8GB “nodes” Intel 3.0Ghz Dual Core Woodcrest Processors
Jobs may be scheduled across nodes for a total of 160p
Shared memory jobs can span up to 8p and 16GB
10TB Shared Disk available to all nodes
LSF Scheduler
One job per node – 8p chunks – can support up to 40 simultaneous jobs
Shared use as Open system and TDS (test and development system)
Much lower power cost – Intel power management
System already maintained and in use
System covered 24x7 UPS, generator
Possible short-notice downtime
Features/Issues:
8
Head Node for System Administration
• “Build” Nodes
• Running Parallel Tools
– (pdsh, pdcp, etc.)
SSH Communications Between Nodes
• Localized Infiniband Network
• Private Ethernet
Dell Remote Access Controllers
• Private Ethernet
• Remote Power On/Off
• Temperature Reporting
• Operability Status
• Alarms
• 10 Blades Per Chassis
CFS Lustre Filesystem
• Shared Access
• High Performance
• Using Infiniband Fabric
User WebtopUser
Webtop
3 Interactive Nodes
(12 cores)
Head Node
Simulation EngineSimulatio
n EngineSimulation EngineSimulatio
n Engine
1280 Batch (5120 Cores)
Networks
NetworksNetworks
NetworksNetworks
NetworksNetworks
NetworksDREN
Networks
DREN Network
s
StorageStorage
StorageStorageDDN
200 TB
10 Gig-E Ethernet
Fibre
Cisco Infiniband(Copper)
Cisco 6500 Core
Fibre Channel
Jaws Architecture
User WebtopUser
WebtopUser Webtop
24 Lustre I/O Nodes,
1 MDS
Gig-E nodes with 10 Gig-E uplinks. 40 nodes per
uplink.
9
Systems Software
• Red Hat Enterprise Linux v4
– 2.6.9 Kernel
• Infiniband
Cisco Software stack
• MVAPICH
– MPICH 1.2.7 over IB Library
• Gnu 3.4.6 C/C++/Fortran
• Intel 9.1 C/C++/Fortran
• Platform LSF HPC 6.2
• Platform Rocks
Shark Software
10
Maintenance Schedule
New Proposed Schedule• 8:00am – 5:00pm • 2nd and 4th Wednesdays (as necessary)• Check website for maintenance notices
Current • 2:00pm – 4:00pm• 2nd and 4th Thursday (as necessary)• Check website (mhpcc.hpc.mil) for maintenance notices
Only take maintenance on scheduled systems Check on Mondays before submitting jobs
11
Contact Helpdesk or website for application information
Documentation Needed:
• Account names, systems, special requirements
• Project title, nature of work, accessibility of code
• Nationality of applicant
• Collaborative relevance with AFRL
New Requirements
• “Case File” information
• For use in AFRL research collaboration
• Future AFRL applicability
• Intellectual property shared with AFRL
Annual Account Renewals
• September 30 is final day of the fiscal year
Account Applications and Documentation
12
Summary
Anticipated migration to Shark
Should be more productive and able to support wide range of jobs
Cutting edge technology
Cost savings from Hurricane (~$400,000 annual)
Stay tuned for timeline – likely end of January, early February