SGE Training NASA LaRC ASDC Delivered May 5,6,7 2009 Chris Dwan Bioteam

Preview:

DESCRIPTION

SGE Training NASA LaRC ASDC Delivered May 5,6,7 2009 Chris Dwan Bioteam. Bioteam Inc. Independent Consulting Shop Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT Many years of industry & academic experience - PowerPoint PPT Presentation

Citation preview

SGE TrainingNASA LaRC ASDC

Delivered May 5,6,7 2009Chris Dwan

Bioteam

cdwan@bioteam.net

Bioteam Inc. Independent Consulting Shop

Vendor/technology agnostic Staffed by:

Scientists forced to learn High Performance IT Many years of industry & academic experience

Our specialty: Bridging the gap between Science & IT

cdwan@bioteam.net

Session Goals

cdwan@bioteam.net

Interactive / Small Group Goals 1 - 2 hours 1 – 5 people Users log into systems. Users type examples, run jobs. If code is available, bring it. If specific use cases exist, bring them.

cdwan@bioteam.net

Selected ASDC Systems

cdwan@bioteam.net

Selected ASDC Systems Apple Cluster

Online and in use at SCF since 2007 ~40 dual processor OS X systems (80+ CPUs) Access through manila and corregidor

Magneto ~28 quad core linux servers (100+ CPUs) Online and in production use since 2006

New Magneto (ORR May 15) Large, mixed purpose Linux cluster / file store 176 CPUs dedicated to SCF 576 CPUs dedicated to production Disk based archive: 1.1PB

cdwan@bioteam.net

Apple Cluster Access:

LDAP account manila or corregidor

cdwan@bioteam.net

NASA LaRC Science Directorate

cdwan@bioteam.net

Picture taken 9/2/08 1.2PB usable space Fibre connected (384+ fibre

ports) 2,560 individual disk drives

16 disks per chassis 10 chassis per rack 16 racks of disks

IBM Linux servers, mixed P6 and x86 CPUs to support legacy codes

Filesystem: IBM GPFS

Operational Readiness ReviewMid May 2009

Stay Tuned

cdwan@bioteam.net

cdwan@bioteam.net

cdwan@bioteam.net

cdwan@bioteam.net

Interactive hosts

cdwan@bioteam.net

Sun Grid Engine

Technical Introduction

cdwan@bioteam.net

Please do not copy, put online or redistribute info@bioteam.net

Most “grids” look like this on paper…

Private Network

Local Area Network

Portal node(s)Dedicated File services

Compute Nodes

Please do not copy, put online or redistribute info@bioteam.net

… and in reality:

Please do not copy, put online or redistribute info@bioteam.net

… and in reality:

Please do not copy, put online or redistribute info@bioteam.net

… and in reality:

Sun Grid Engine Historyhttp://blogs.sun.com/templedf/entry/a_little_history_lesson 1996:

Codine 4.02 Grid Resource Director (GRD) 1.0

2000: SGE 5.2. Sun acquires Gridware Inc.

2001: SGE 5.3. Sun releases source code Last version called GRD

2004: SGE(EE) vs. SGE N1GE vs. SGE

cdwan@bioteam.net

Sun Grid Engine References http://gridengine.sunsource.net/

Generally, the user manuals are awful

http://gridengine.info/ Very useful blog run by Chris Dagdigian

My slides / examples are going to be online in-house.

Deep, in house expertise.

cdwan@bioteam.net

Please do not copy, put online or redistribute info@bioteam.net

Compute Farm Logical View

Cluster Network

User 1User 1 User NUser N

Distributed Resource Manager

Please do not copy, put online or redistribute info@bioteam.net

Grid Engine does the following:

Accept work requests (jobs) from users Puts jobs in a pending area Sends jobs from the pending area to the

best available machine Manages the job while it runs Returns results, logs accounting data

when the job is finished

Please do not copy, put online or redistribute info@bioteam.net

Huh? What you need to know:

Don’t worry about queues or specific machines. All you need to do when submitting a job is describe the resources your job will need to run successfully.

Grid Engine will take care of the rest The ‘default’ settings are good enough for

most cases

Please do not copy, put online or redistribute info@bioteam.net

Most useful SGE commands qsub / qdel

Submit jobs & delete jobs qstat & qhost

Status info for queues, hosts and jobs qacct

Summary info and reports on completed job qrsh

Get an interactive shell on a cluster node Quickly run a command on a remote host

qmon Launch the X11 GUI interface

Examples

cdwan@bioteam.net

Live Examples Single job Single job with resource requirements Job dependency Task array job Demand a whole compute node Consumable resources

cdwan@bioteam.net

Recommended