18
MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison

MURI Hardware Resources

  • Upload
    kirsi

  • View
    50

  • Download
    4

Embed Size (px)

DESCRIPTION

MURI Hardware Resources. Ray Garcia Erik Olson. Space Science and Engineering Center at the University of WI - Madison. Resources for Researchers. CPU cycles Memory Storage space Network Software Compilers Models Visualization programs. Original MURI hardware. 16 P III processors - PowerPoint PPT Presentation

Citation preview

Page 1: MURI Hardware Resources

MURI Hardware Resources

Ray Garcia

Erik Olson

Space Science and Engineering Center at the University of WI - Madison

Page 2: MURI Hardware Resources

Resources for Researchers

CPU cycles Memory Storage space Network Software

Compilers Models Visualization programs

Page 3: MURI Hardware Resources

Original MURI hardware

16 PIII processors Storage server with 0.5 TB Gigabit networking Purpose:

Provide working environment for collaborative development.

Enable running of large multiprocessor MM5 model.

Gain experience working with clustered systems.

Page 4: MURI Hardware Resources

Capabilities and Limitations

Successfully ran initial MM5 model runs, algorithm development (fast model), and modeling of GIFTS optics (FTS simulator).

MM5 model runs for 140 by 140 domains. One 270 by 270 run with very limited time steps.

OpenPBS system scheduling hundreds of jobs. Idle CPU time given to FDTD raytracing. Expanded to 28 processors using funding from B.

Baum, IPO, and others. However, MM5 model runtime limited domain size and

storage space limited number of output time steps.

Page 5: MURI Hardware Resources

CY2003 Upgrade

NASA provided funding for 11 Dual-Pentium4 processor nodes 4GB DDR-RAM 2.4GHz CPUs

Expressly purposed for running large IHOP field program simulations (400 by 400 grid point domain).

Page 6: MURI Hardware Resources

Cluster “Mark 2”

Gains: Larger scale model runs and instrument simulations

as needed for IHOP Terabytes of experimental and simulation data online

through NAS, hosted RAID arrays

Limitations to further work at even larger scale Interconnect limitations slowed large model runs 32-bit memory limitation on huge model set-up jobs

for MM5 and WRF Increasing number of small storage arrays

Page 7: MURI Hardware Resources

3 Years of Cluster Work

Inexpensive Adding CPUs to the system

Costly Adding users to the system Adding storage to the system

Easily understood Matlab

Not so well-understood Distributed system (computing, storage) capabilities

Page 8: MURI Hardware Resources

Along comes DURIP

H.L.Huang / R.Garcia DURIP proposal awarded May 2004.

Purpose: Provide hardware for next generation research and education programs.

Scope: Identify computing and storage systems to serve the need to expand simulation, algorithm research, data assimilation and limited operational product generation experiments.

Page 9: MURI Hardware Resources

Selecting Computing Hardware

Cluster options for numerical modeling were evaluated and found to require significant time investment.

Purchased SGI Altix fall of 2004 after extensive test runs with WRF and MM5. 24 - Itanium2 processors running Linux 192GB of RAM 5TB of FC/SATA disk

Recently upgraded to 32 CPUs, 10TB storage.

Page 10: MURI Hardware Resources

SGI Altix Capabilities

Large, contiguous RAM allows 1600 by 1600 grid point domain (> CONUS area at 4 km res).

Largest so far is 1070 by 1070. NUMAlink interconnect provides

fast turn around for model runs Presents itself as a single

32-CPU Linux machine Intel compilers for ease of

porting and optimizing Fortran/C on 32-bit and 64-bit hardware.

Page 11: MURI Hardware Resources

Storage Class: Home Directory

Small size for source code (preferably also held under CVS control) and critical documents

Nightly incremental backups Quota enforcement Current implementation

Local disks on cluster head Backup by TC

Page 12: MURI Hardware Resources

Storage Class: Workspace

Optimized for speed Automatic flushing of unused files No insurance against disk failure Users expected to move important

results to Long-term Storage Current implementation

RAID5 or RAID0 drive arrays within the cluster systems

Page 13: MURI Hardware Resources

Storage Class: Long-term

Large amount of space Redundant, preferably back-up to tape Managed directory system, preferably

with metadata Current implementation

Lots of project-owned NAS devices with partial redundancy (RAID5)

NFS spaghetti Ad-hoc tape backup

Page 14: MURI Hardware Resources

DURIP phase 2: Storage

Long term storage scaling and management goals: Reduce or eliminate NFS ‘spaghetti’ Include hardware phase-in / phase-out strategy in

purchase decision Acquire the hardware to seed a Storage Area

Network (SAN) in the Data Center, improving uniformity and scalability

Reduce overhead costs (principally human time) Work closely with Technical Computing group on

system setup and operations for a long-term facility

Page 15: MURI Hardware Resources

Immediate Options

Red Hat GFS Size limitations and hardware/software mix-and-

match; Support costs make up for free source code. HP Lustre

More likely to be a candidate for workspace. Expensive.

SDSC SRB (Storage Resource Broker) Stability, documentation, and maturity at time of

testing found to be inadequate. Apple Xsan

Plays well with third-party storage hardware. Straightforward to configure and maintain. Affordable.

Page 16: MURI Hardware Resources

Dataset Storage Purchase Plan

64-bit storage servers and meta-data server Qlogic Fibre channel switch

Move data between hosts, drive arrays

SAN software to provide distributed filesystem Focusing on Apple Xsan for 1-3 year span Follow up with 1-year assessment with option of

re-competing

Storage arrays Competing Apple XRAID, Western Scientific Tornado

Page 17: MURI Hardware Resources

Target System for 2006

Scalable dataset storage accessible from clusters, workstations, and supercomputer Backup strategy

Update existing cluster nodes to ROCKS Simplified management and improve uniformity Proven on other clusters deployed by SSEC

Retire/repurpose slower cluster nodes Reduce bottlenecks to workspace disk Improve ease of use and understanding

Page 18: MURI Hardware Resources

Long-term Goals

64-bit shared memory system scaled to huge job requirements (Altix)

Complementary compute farm migrating to x86-64 (Opteron) hardware

Improved workspace performance Scalable storage with full metadata for long-

term and published datasets Software development tools for multiprocessor

algorithm development