13
1 Maui High Performance Computing Center Open System Support An AFRL, MHPCC and UH Collaboration December 18, 2007 Mike McCraney MHPCC Operations Director

1 Maui High Performance Computing Center Open System Support An AFRL, MHPCC and UH Collaboration December 18, 2007 Mike McCraney MHPCC Operations Director

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

1

Maui High Performance Computing Center

Open System Support

An AFRL, MHPCC and UH Collaboration

December 18, 2007

Mike McCraneyMHPCC Operations Director

2

Agenda

MHPCC Background and History Open System Description Scheduled and Unscheduled Maintenance Application Process Additional Information Required Summary and Q/A

3

An AFRL Center

An Air Force Research Laboratory Center

Operational since 1993

Managed by the University of Hawaii• Subcontractor Partners – SAIC / Boeing

A DoD High Performance Computing Modernization Program (HPCMP) Distributed Center

Task Order Contract – Maximum Estimated Ordering Value = $181,000,000• Performance Dependent – 10 Years• 4 Year Base Period with 2, 3-Year Term Awards

An Air Force Research Laboratory Center

Operational since 1993

Managed by the University of Hawaii• Subcontractor Partners – SAIC / Boeing

A DoD High Performance Computing Modernization Program (HPCMP) Distributed Center

Task Order Contract – Maximum Estimated Ordering Value = $181,000,000• Performance Dependent – 10 Years• 4 Year Base Period with 2, 3-Year Term Awards

4

A DoD HPCMP Distributed Center

Distributed CentersDistributed Centers Allocated Distributed Centers

• Army High Performance Computing Research Center (AHPCRC)• Arctic Region Supercomputing Center (ARSC)• Maui High Performance Computing Center (MHPCC)• Space and Missile Defense Command (SMDC)

Dedicated Distributed Centers• ATC• AFWA• AEDC• AFRL/IF• Eglin • FNMOC• JFCOM/J9

Major Shared Resource CentersMajor Shared Resource Centers Aeronautical Systems Center (ASC) Army Research Laboratory (ARL) Engineer Research and Development Center (ERDC) Naval Oceanographic Office (NAVO)

High Performance Computing Modernization ProgramHigh Performance Computing Modernization ProgramHigh Performance Computing Modernization ProgramHigh Performance Computing Modernization Program

Director,Director,Defense Research and EngineeringDefense Research and Engineering

Director,Director,Defense Research and EngineeringDefense Research and Engineering

DUSDDUSD(Science and Technology)(Science and Technology)

DUSDDUSD(Science and Technology)(Science and Technology)

• NAWC-AD• NAWC-CD• NUWC • RTTCRTTC• SIMAFSIMAF• SSCSDSSCSD• WSMRWSMR

5

MHPCC HPC History

MHPCC HPC Growth

0

2000

4000

6000

8000

10000

12000

14000

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Mem

ory

/Pro

cess

ors

0

50

100

150

200

250

300

350

Dis

k (T

B)/

TF

lop

s

Processors Memory Disk Tflops

1994 - IBM P2SC Typhoon Installed

1996 - 2000 IBM P2SC

2000 - IBM P3 Tempest Installed

2001 - IBM Netfinity Huinalu Installed

2002 - IBM P2SC Typhoon Retired

2002 - IBM P4 Tempest Installed

2004 - LNXi Evolocity II Koa Installed

2005 - Cray XD1 Hoku Installed

2006 - IBM P3 Tempest Retired

2007 - IBM P4 Tempest Reassigned

2007 - Dell Poweredge Jaws Installed

6

Eight, 32 processor/32GB “nodes” IBM P690 Power4

Jobs may be scheduled across nodes for a total of 288p

Shared memory jobs can span up to 32p and 32GB

10TB Shared Disk available to all nodes

LoadLeveler Scheduling

One job per node – 32p chunks – can only support 8 simultaneous jobs

Issues:

Old technology, reaching end of life, upgradability issues

Cost prohibitive – Power consumption constant ~$400,000 annual power cost

Hurricane Configuration Summary

Current Hurricane Configuration:

7

Dell Configuration Summary

Proposed Shark Configuration: 40, 4 processor/8GB “nodes” Intel 3.0Ghz Dual Core Woodcrest Processors

Jobs may be scheduled across nodes for a total of 160p

Shared memory jobs can span up to 8p and 16GB

10TB Shared Disk available to all nodes

LSF Scheduler

One job per node – 8p chunks – can support up to 40 simultaneous jobs

Shared use as Open system and TDS (test and development system)

Much lower power cost – Intel power management

System already maintained and in use

System covered 24x7 UPS, generator

Possible short-notice downtime

Features/Issues:

8

Head Node for System Administration

• “Build” Nodes

• Running Parallel Tools

– (pdsh, pdcp, etc.)

SSH Communications Between Nodes

• Localized Infiniband Network

• Private Ethernet

Dell Remote Access Controllers

• Private Ethernet

• Remote Power On/Off

• Temperature Reporting

• Operability Status

• Alarms

• 10 Blades Per Chassis

CFS Lustre Filesystem

• Shared Access

• High Performance

• Using Infiniband Fabric

User WebtopUser

Webtop

3 Interactive Nodes

(12 cores)

Head Node

Simulation EngineSimulatio

n EngineSimulation EngineSimulatio

n Engine

1280 Batch (5120 Cores)

Networks

NetworksNetworks

NetworksNetworks

NetworksNetworks

NetworksDREN

Networks

DREN Network

s

StorageStorage

StorageStorageDDN

200 TB

10 Gig-E Ethernet

Fibre

Cisco Infiniband(Copper)

Cisco 6500 Core

Fibre Channel

Jaws Architecture

User WebtopUser

WebtopUser Webtop

24 Lustre I/O Nodes,

1 MDS

Gig-E nodes with 10 Gig-E uplinks. 40 nodes per

uplink.

9

Systems Software

• Red Hat Enterprise Linux v4

– 2.6.9 Kernel

• Infiniband

Cisco Software stack

• MVAPICH

– MPICH 1.2.7 over IB Library

• Gnu 3.4.6 C/C++/Fortran

• Intel 9.1 C/C++/Fortran

• Platform LSF HPC 6.2

• Platform Rocks

Shark Software

10

Maintenance Schedule

New Proposed Schedule• 8:00am – 5:00pm • 2nd and 4th Wednesdays (as necessary)• Check website for maintenance notices

Current • 2:00pm – 4:00pm• 2nd and 4th Thursday (as necessary)• Check website (mhpcc.hpc.mil) for maintenance notices

Only take maintenance on scheduled systems Check on Mondays before submitting jobs

11

Contact Helpdesk or website for application information

Documentation Needed:

• Account names, systems, special requirements

• Project title, nature of work, accessibility of code

• Nationality of applicant

• Collaborative relevance with AFRL

New Requirements

• “Case File” information

• For use in AFRL research collaboration

• Future AFRL applicability

• Intellectual property shared with AFRL

Annual Account Renewals

• September 30 is final day of the fiscal year

Account Applications and Documentation

12

Summary

Anticipated migration to Shark

Should be more productive and able to support wide range of jobs

Cutting edge technology

Cost savings from Hurricane (~$400,000 annual)

Stay tuned for timeline – likely end of January, early February

13

Mahalo