UT Grid Project Jay Boisseau, Texas Advanced Computing Center SURA Grid Application Planning &...

Preview:

Citation preview

UT Grid Project

Jay Boisseau, Texas Advanced Computing Center

SURA Grid ApplicationPlanning & Implementations Workshop

December 7, 2005

Outline

• Overview– Vision– Strategy– Approach

• Current Project Status, Near-Term Goals– UT Grid Production Compute Resources

• Roundup• Rodeo

– Interfaces to production resources: • Grid User Portal• Grid User Node

– Tools to support resources: • GridPort• GridShell• Metascheduling Prediction Services

• Future Work and Plans

UT Grid Vision: A Powerful, Flexible, and Simple Virtual Environment for Research & Education

The UT Grid project vision is to create a cyberinfrastructure for research and education in which people can develop and test ideas, collaborate, teach, and learn through applications that seamlessly harness the diverse campus compute, visualization, storage, data, and instruments as needed from their personal systems (PCs) and interfaces (web browsers, GUIs, etc.).

UT Grid: Develop and Provide a Unique, Comprehensive Cyberinfrastructure…

The strategy of the UT Grid project is to integrate…– common security/authentication– scheduling and provisioning– aggregation and coordination

diverse campus resources…– computational (PCs, servers, clusters)– storage (Local HDs, NASes, SANs, archives)– visualization (PCs, workstations, displays, projection rooms)– data collections (sci/eng, social sciences, communications, etc.)– instruments & sensors (CT scanners, telescopes, etc.)

from ‘personal scale’ to terascale…– personal laptops and desktops– department servers and labs– institutional (and national) high-end facilities

…That Provides Maximum Opportunity & Capability for Impact in Research, Education

…into a campus cyberinfrastructure…– evaluate existing grid computing technologies– develop new grid technologies– deploy and support appropriate technologies for production use– continue evaluation, R&D on new technologies– share expertise, experiences, software & techniques

that provides simple access to all resources…– through web portals– from personal desktop/laptop PCs, via custom CLIs and GUIs

to the entire community for maximum impact on– computational research in applications domains– educational programs– grid computing R&D

Maytal Dahan
Should Focus on 'Entire Research and Development' Community

Add Services Incrementally, Driven By User Requirements

Texas Two-Step: Hub & Spoke Approach

• Deploying P2P campus grid requires overcoming two trust issues– grid software: reliability, security, and performance– each other: not to abuse one’s own resources

• Advanced computing center presents opportunity to build centrally manage grid as step to P2P grid– already has trust relationships with users– so, when facing both issues, install grid software centrally first

• create centrally managed services• create spokes from central hub

– then, when grid software is trusted• show usage and capability data to demonstrate opportunity• show policies and procedures to ensure fairness• negotiate spokes among willing participants

UT Grid: Logical View

• Integrate a set of resources(clusters, storage systems, etc.)within TACC first

TACC Compute,Vis, Storage, Data

(actually spread across two campuses)

UT Grid: Logical View

• Next add other UTresources usingsame tools andprocedures

ACES Data

ACES PCs

TACC Compute,Vis, Storage, Data

ACES Cluster

UT Grid: Logical View

• Next add other UTresources usingsame tools andprocedures

ACES PCs

GEO Cluster

GEO Data

TACC Compute,Vis, Storage, Data

GEO Cluster

ACES DataACES Cluster

UT Grid: Logical View

ACES PCs

BIO Data BIO Instrument

PGE Cluster

PGE Instrument

• Next add other UTresources usingsame tools andprocedures

PGE Data

TACC Compute,Vis, Storage, Data

ACES DataACES Cluster

GEO Data

GEO Cluster

GEO Cluster

UT Grid: Logical View

• Finally negotiateconnectionsbetween spokesfor willing participantsto develop a P2P grid. ACES PCs

BIO Data BIO Instrument

PGE Cluster

PGE Data

PGE Instrument

TACC Compute,Vis, Storage, Data

ACES DataACES Cluster

GEO Data

GEO Cluster

GEO Cluster

Enhancing Grid Computing R&D and Deployment Expertise for UT and for IBM

• Benefits for IBM– Increased knowledge of diverse grid user and application

requirements in universities– Access to new software technologies developed for UT Grid– Early awareness of new distributed & grid computing R&D

opportunities– Exposure & expertise in a variety of grid technologies, open

source & commercial, which can be shared internally– Experience to be gained from maintaining a large distributed

production grids– Collaboration with UT in conducting new distributed & grid

computing R&D activities, including publications, proposals– Exposure among TACC’s collaborators and peers for

expertise in grid deployment services, capabilities

Enhancing Grid Computing R&D and Deployment Expertise for UT and for IBM

• Benefits for UT Austin– greater access to all resources by entire community– more effective utilization of existing and future resources– unique capabilities presented by access, aggregation,

coordination for research, education– enhanced collaborative capabilities among researchers, and

among teachers & students• Additional Benefits for TACC

– increased expertise in grid deployment issues– early awareness of new distributed & grid computing R&D

opportunities– platform for conducting new distributed & grid computing

R&D activities

Enhancing Grid Computing R&D and Deployment Expertise for UT and for IBM

• Benefits for TACC Partners– UT Grid-supported technologies being integrated into

TeraGrid: GridPort/user portal, GridShell/user node, etc.– Expertise being developed in scheduling will be used in

TeraGrid– UT Grid developments will be used

• in TIGRE and SURA Grid• by TACC partners in UT System, HiPCAT, U.S., Latin America• by TACC industrial partners

• Benefits for Community– UT Grid producing IBM DeveloperWorks articles– UT Grid R&D will produce professional papers in Year 2

(and proposals)

TACC Grid Technology & Deployment Activities Provide Synergy Through Tech Transfer

• UT Grid– creating new tools for integrating compute, vis, storage and data

across campus, from ‘personal scale’ to terascale– will exchange tools, experiences with TeraGrid & TIGRE to

advance both and be interoperable with each• TeraGrid

– will utilize & promote UT Grid user portal & user node technologies, and scheduling & workflow results

– will provide grid visualization and data collection services to UT Grid, benefiting TACC and IBM

• TIGRE– will utilize, promote UT Grid results and expertise to other state

institutions, including industry– will provide additional experiences with UT Grid technologies from

users from across state, helping to refine technologies

UT Grid Compute Resources

• PCs and workstations– Roughly 1/2 are Windows on Intel/AMD and 1/3 are Macs– Most of rest are Linux on Intel/AMD

• Networks of PCs and Workstations– Roundup: United Devices-managed network of PCs

• Non-dedicated, heterogeneous compute resources across campus• Some managed by TACC, ITS, or other departments; some individually managed• Windows, Linux & Mac desktop PCs

– Rodeo: Condor-managed network of PCs• Dedicated & non-dedicated, heterogeneous compute resources• Some managed by TACC, ITS, or other departments; some individually managed• Linux, Windows & Mac PCs , plus some workstations

• Clusters– Lonestar: 1024-processor Linux at TACC– Wrangler: 656-processor Linux cluster at TACC– Longhorn: 128-processors in 4-way IBM p655 nodes at TACC– Other smaller clusters at TACC– Various department/lab cluster from 4 to 128+ processors will be included– Resources have different resource managers (LSF, PBS, SGE)

• High-end Servers – Longhorn: IBM system 32 Power4 processors, 128 GB memory– Maverick: Sun system w/64 dual-core UltraSPARC 4 procs, 512 GB mem

Interfaces and Tools for these Resources

• For a broad, diverse campus community, access must be easy and from local resources– Users access Grid User Portal with standard web browser

• Grid User Portal submits to Rodeo via SOAP– UT-Grid Condor Web Services layer developed to facilitate– Condor portlet part of GridPort 4 release

• Grid User Portal submits to Roundup via Hosted Applications– Users access Grid User Node with SSH

• Grid User Node submits to Rodeo via GridShell– GridShell provides command line interface through shell façade– Abstracts user from underlying grid technology and complexity– Submits to specific resource or determines most appropriate resource using

catalog services• Grid User Node submits to Roundup

– Batch job submission supported via GridShell– CLI for submitting hosted application jobs

Accessing UT Grid Compute ResourcesHosted User Nodes & Portals

PC Grid

User

HPC

Storage

Compute Resources

GRAM/GridFTP

GRAM/GridFTPGRAM/GridFTP

GRAM/GridFTP

Visualization

GRAM/GridFTP

Grid User Portal Grid User Node (Windows, Mac, Linux)

Condor UnitedDevices

GRAM/GridFTPGRAM/GridFTP

Current Status & Near Term Goals

Roundup

FrioUnited Devices

Roundup Grid MP

`

`

`

ITS

`

`

`ENG

College of Engineering

Computer Science`

`CS

ITS

`

`TACC

TACC

``

COC

College of Fine Arts

College of Communication

Roundup

UT Grid User

Grid User Portal Grid User Node (Windows, Mac, Linux)

Roundup

Roundup: Current Status

• Roundup is a production UT Grid resource– Production system with over 1000 PCs distributed in

campus– Automated account request and creation– Production level consulting– Comprehensive user guide– Training classes offered at TACC

• Client downloads available for Windows, Mac, Linux from UT Grid web site

• Hosted Applications Installed– HMMer, BLAST, POV-Ray, Coorset, etc.

Roundup: Next Steps

• Near-term goals (few months):– Support additional production users– GSI Integration

• United devices GridMP has capability for multiple authentication schemes

• Need to add support extension for GSI

– Evaluate MP Insight data warehousing and report generation package

– Test and evaluate screen saver feature and start development of UT specific screen saver

– Investigate possible solutions to enable sharing jobs across grids• Multi-grid agents or job forwarding

Rodeo

`

`

`

ICES

`

`

` CS

ComputerScience

Condor Pool ICES Condor Pool

Rodeo

UT Grid User

Grid User Portal Grid User Node (Windows, Mac, Linux)

Collector/NegotiatorCollector/Negotiator

TACC Condor Pool

Rodeo: Current Status

• Rodeo is a production UT Grid resource– Production system with over 500 PCs made up of

dedicated clusters and PCs distributed on campus– Automated account request and creation– Production level consulting– Comprehensive user guide– Training classes offered at TACC

• Client downloads available for Windows, Mac, Linux from UT Grid web site

Rodeo: Current Status

• Currently the largest production users are:– UTCS (Department of Computer Sciences)– Graeme Henkleman (Chemistry)– Wolfgang Bangerth (Geosciences)

Rodeo: Next Steps

• Near term goals (few months):– Continue supporting production users– Expand on number of CPUs available to users– Explore ‘hosted’ applications possibilities

UT Grid Interfaces

• UT Grid will provide two types of interfaces: – Web-based Grid User Portal (GUP) accessible via any web

browser– Customized desktop environments for Linux, Windows and

Macintosh PCs to act as Grid User Nodes (GUN).

• Users can access all UT Grid resources using either the GUP or GUNs managed by UT Grid.

• They will also be able to download the necessary software to build and host their own customized grid user portals or convert their personal desktop systems into grid user nodes.

Motivation for a Grid User Portal

• Lower the barrier of entry for novice user• Provide a centralized grid account management

interface– Easy access to multiple resources through a single interface

• Simple GUI interface to complex grid computing capabilities– Provide simple alternatives to CLI for advanced users

• Present a “Virtual Organization” view of the Grid as a whole

• Increase productivity of UT researchers – do more science!

Grid User Portal: Current Status

• Added Roundup and Rodeo as production resources on TACC User Portal

• Developed JSR-168 Compliant portlets that can:– View information on resources within UT Grid, including

status, load, jobs, queues, etc.– View network bandwidth and latency between systems,

aggregate capabilities for all systems.– Submit user jobs– Manage files across systems, and move/copy multiple files

between resources with transfer time estimates• These portlets contribute to GridPort 4 release• TACC leading portal effort in TeraGrid

– This will impact TACC User Portal and therefore UTGrid

Grid User Portal: Next Steps

• New term plans (few months):– Complete new TACC User Portal (TUP) based on

GridPort 4 including UT Grid resources• UT Grid capabilities fully integrated into TUP• Ability to customize environment to only expose UT Grid

resources

– Migrate portlets to WebSphere to ensure compatibility(?)

– Grid Account Management Portlets

Grid User Node

• The Linux GUN current capabilities:– Information queries about grid resources– Job submission

• Parallel computing jobs (Dedicated Cluster Resources)• Serial computing jobs (Roundup, Rodeo)

– Monitoring job status– Reviewing job results– Resource brokering based on ClassAd catalogs– GridFTP enabled GSIFTP

Grid User Node: Current Status

• Production Linux, development Windows and Mac GUNs– Need to decide whether to do GUI versions

• Submission to Roundup and Rodeo• “On-Demand” glide-in of UD resources into Condor

pool• Integrated “real-life” applications

– NAMD– SNOOP3D– HMMeR– POVray

Grid User Node: Next Steps

• Near term goals: – Investigating distribution of GUN software stack

using VDT– Prepare and present training class before the end

of the year.

GridPort: Current Status

• GridPort 4 developed and released this month

• Available to UT Grid and national users as a grid portal toolkit to download and create user and application portals

• Based on JSR-168 compliant portlets• Leveraged technology and knowledge in UT

Grid to create Condor and Comprehensive file transfer portlets

GridPort: Next Steps

• Near term goals– GridPort4 will be part of the TeraGrid User Portal,

to be in production in 1Q06– Preparing demonstration and lab at Grid

Workshop in Venezuela in April 2006– Continue evolution of GridPort to include:

• Advanced job submission functionality• Advanced user customization, and more

– Investigating demo portal based in WebSphere

GridShell: Current Status

• GridShell developed and deployed on UT Grid (and TeraGrid)

• Available to UT Grid users in the GUN software stack.

• Able to submit jobs first to a Spoke (departmental cluster) and then to the Hub (TACC) if not enough resources are available at the Spoke.

• Collaborating with researchers at PSC and Caltech, we have extended GridShell to provide a single job submission interface (Condor) to the heterogeneous clusters on the TeraGrid.

GridShell: Next Steps

• Near term goals– Create a public download site for GridShell 1.0

(current version available only to NSF TeraGrid and UT Grid users).

– Continue evolution of GridShell to include:• Support submitting jobs to clusters with firewalls

– Need to hire an additional developer and developers partnerships with external developers (e.g. GridPort)

MPS: Current Status

• Goal is to reduce turn around times of jobs by optimizing resource selection for data movements, queue wait times, and performance

• Components– Prediction Services

• Execution times, queue wait times, file transfer times

– Resource Brokering• Immediately select resources based on job requirements

– Including predictions

– Metascheduling• Schedule complex jobs such as workflows• Workload management

MPS: Next Steps

• Near term goals:– Create prediction web services

• Based on existing R&D• Predictions based on

– Historical information– Learning algorithms– Scheduling simulations

– Integrate with Condor-G• Provide additional information about clusters

– The clusters themselves (e.g. number of CPUs)– The jobs submitted to the clusters

• Add call outs so matchmaker can request predictions• User requests minimizing predicted response time as part of ranking

– Demonstration with Graham Carey (ICES / UT Austin)• Selecting which cluster at TACC to use• Matchmaking capability using MPS to rank systems based on user

request

Future Plans and Work

• Complete MPS work and integrate campus cluster with TACC clusters– First, just ‘upload’ larger jobs– Later, share jobs among spokes

• Integrate maverick as remote visualization resource into UT Grid– Overlapping software stack with PCs– Remote vis software downloads (incl. file transfer)– Vis portal

• Integrate campus data collections into UT Grid– Hosted collections in DBs– WebSphere Information Integrator?

• Prepare NSF proposal?

Recommended