Grid Computing 7700 Fall 2005 Lecture 2: About Grid Computing Gabrielle Allen [email protected] gallen/Teaching

Grid Computing 7700Fall 2005

Lecture 2: About Grid Computing

Gabrielle [email protected]

http://www.cct.lsu.edu/~gallen/Teaching

mailto:[email protected]

Quick Test

What reason does Foster (2002) give that the Web is not a Grid?

Advances in which area have changed the way we should think about collaboration: a) sensors, b) supercomputers, c) mass storage, d) networks, e) HDTV

What is GGF an acronym for? What speed do gravitational waves travel at?

a) speed of sound, b) speed of light, c) infinite speed, d) 103,457 km/s, e) they do not move

Some History1993 Legion project starts1993 HPF specification released1994 MPI-1 specification released1994 Nimrod project starts (LAN

based)1994 First beowulf cluster1995 Dot.com era starts …1995 Netscape goes public1995 FAFNER: Factoring via Network-

Enabled Recursion1995 I-WAY (Information Wide Area

Year) at SC951995 Globus project (ANL,UC,ISI)

starts1995 Java released by Sun1997 Legion released1997 UNICORE project starts1997 Entropia founded1998 Globus 1.0 released1998 Legion commercial via Applied

Metacomputing (becomes Avaki in 2001)

1999 First Grid Forum

1999 SETI@home 1999 Napster: Centralized file

sharing2000 Microsoft release .NET2000 Gnutella released: P2P file

sharing2001 “Anatomy of the Grid” 2001 NSF announces TeraGrid2001 First Global Grid Forum2001 Cactus, Globus, MPICH-G2

win Gordon Bell prize2002 Earth Simulator: 40TFlop

NEC machine2002 Globus 2.0 released2002 “Physiology of the Grid” 2003 Globus 3.0 released2003 10Gbps transatlantic optical

network demonstrated2005 Globus 4.0 released

2005 TeraGrid awarded $150M

1843 US Congress investigate telegraph technology

1866 Transatlantic telegraph cable laid1901 Transatlantic radio transmission1965 Multics developers envisage

utility computing1969 Unix is developed1970 ARPANET: DoD exerimental WAN,

precusor to internet1972 C written by Ritchie1975 Microsoft founded1980s Parallel computing: algorithms,

programs and architectures1980s “Grand Challenge” applications1985 NSFNET: Links SC centers at 56

kbps1988 Condor project starts (LAN

based)1989 “Metacomputing” term (CASA

project)1990 HTML developed by Tim Berners-

Lee, first browsers1991 Linus Thorvalds works on Linux1993 Mosaic browser released

Fernando Corbato

Designer of multics OS– Mainframe timesharing

OS– Lead to UNIX

In 1965 envisaged a computer facility “like a power company or water company”

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

J. C. R. Licklider

Experimental psychologist Envisaged a “grid” for

scientific research Contributed to development

of ARPANET 1968: Developed a vision of

networked computers that would provide fast, automated support for human decision making



Len Kleinrock

Created the basic principles of packet switching, the technology underpinning the Internet, while a graduate student at MIT

His computer was the first node on the internet

Envisaged spread of computer utilities (1969)



“Grand Challenges”

Fundamental problems in science and engineering with broad economic and scientific impact. They are generally considered intractable without the use of state-of-the-art massively parallel computers

Used by funding agencies from the 80s onwards to motivate advances in science and high performance computing

Brought together distributed teams who started to collaborate around their machines, codes, data, etc

I-WAY: SC95 High speed experimental distributed

computing project. Set up ATM network connecting

supercomputers, mass storage, advanced viz devices at 17 US sites.

30 software engineers, 60 applications, 10 networks (most OC-3c/155Mbps)

Application focused (remote viz, metacomputing, collaboration)

Single interface to schedule and start runs

I-POP machines (17) coordinated I-WAY “virtual machines”, gateways to the I-WAY

I-Soft software for management/programming



Aims of I-WAY

Develop network enabled tools and build collaborative environments on existing networks with differing protocols and properties

Locating and accessing distributed resources Security and reliability Use of distributed resources for computation Uniform access to distributed data Coupling distributed resources

I-WAY Infrastructure I-POP: gateways to I-WAY

– Dedicated point of present machines at each site

– Uniformly configured with standard software environment

– Accessible from internet, inside firewall

– ATM interface for monitoring/management of ATM switch

I-Soft: management and application programming environment– Ran on I-POP machines– Provided uniform authentification, resource reservation, process creation,

communication functions– CRB: Computational Resource Broker (central scheduler)– Security: Telnet client amended with Kerberos authentication and encryption– File system: AFS for shared repository – Communication: Nexus adapted (MPICH, CAVEcomm)

From Ian Taylor

I-WAY New Concepts

Point of presence machines at each site Computational resource broker integrates

different local schedulers Uniform authentication environment and trust

relationships between sites Network-aware parallel programming tools to

provide uniform view and optimize communications

Led to Globus from ISI/ANL

Globus Toolkit® History

0

5000

10000

15000

20000

25000

30000

1997 1998 1999 2000 2001 2002

Downloads per Month from ftp.globus.org

DARPA, NSF, and DOE begin funding Grid work

NASA initiatesInformation Power Grid,DOE increases support

Globus Project winsGlobal Information

InfrastructureAward

MPICH-Greleased

The Grid: Blueprint for a New ComputingInfrastructure published

GT 1.0.0Released

Early ApplicationSuccesses Reported

GT 1.1.1Released

GT 1.1.2Released

GT 1.1.3Released

NSF & European CommissionInitiate Many New Grid Projects

GT 1.1.4 andMPICH-G2 Released

Anatomy of the GridPaper Released

FirstEuroGlobusConference

Held inLecce

SignificantCommercial

Interest inGrids

NSF GRIDS CenterInitiated, DOE begins

SciDAC program

GT 2.0 betaReleased

Physiology of the GridPaper Released

GT 2.0Released

GT 2.2Released

Does not include downloads from:NMI, UK eScience, EU Datagrid,IBM, Platform, etc.

1997 1998 1999 2000 2001 2002From Globus TeamFrom Globus Team

Some Application Areas

Life sciences– Computational biology,

bioinformatics, genomics– Access, collecting and

mining data, imaging Engineering

– Aircraft design, modeling and monitoring

Data – High energy physics,

astronomy Physical sciences

– Numerical relativity, material science, geoscience

Collaborations– Sharing, real time

interactivity, visualization, communication

Commercial– Gaming, idle workstations,

climate predication, disaster, cyber security, portals

Education and distance learning

Some Application Types

Minimal communication (embarrassingly parallel)

Staged/linked/workflow Access to Resources Fast throughput Large scale Adaptive Real-time on demand Speculative We will read about these and new application

scenarios later

What are Grids? Provide: “coordinated resource sharing and

problem solving in dynamic, multi-institutional, virtual organizations”

Grids link together people, computers, data, sensors, experimental equipment, visualization systems and networks (Virtual Organizations)

For example, they can provide– Sharing of computer resources– Pooling of information– Access to specialized equipment– Increased efficiency and on-demand computing– Enable distributed collaborations

Need to think about hardware, software, applications and policies.

Grid Checklist

A Grid … Coordinates resources that are not subject to

centralized control Uses standard, open, general purpose

protocols and interfaces Delivers non-trivial qualities of service

Ian Foster, “What is the Grid? A Three Point Checklist”, 2002

Grid ResourcesComputers Any networked CPU Supercomputers &

Clusters Workstations Home PCs PDAs Telephones Game machines Very different

properties: clock speed, memory, cache, FPUs, memory bandwidth, OS, software

Devices Sensors Telescopes Gravitational wave

detectors Microscopes Synchrotrons Medical scanners Etc

Data Belonging to a single

user or shared across a VO

Global distributed databases (e.g. NVO, Genome)

Storage devices Security, access

considerations

Visualization Servers Renderers Access Grid Eg CCT Imaginarium

Networks High speed optical networks

(e.g. NLR) Academic networks:

Internet2 Commercial network

providers Wireless, bluetooth, 3G, etc.

Characteristics

Different heterogeneous resources from different organizations

Mutually distrustful organizations Differing security requirements and policies Dynamic quality of service (machines,

networks etc) Heterogeneous networks Capabilities: Dynamic, adaptive, autonomic,

discovery

Who Will Use The Grid

Computational scientists and engineers Experimental scientists Collaborations Educators Enterprises Governments Health authorities Use cases should be driving Grid

developments, so important to understand needs and translate to requirements.

Computational Scientists and Engineers

Numerical simulation, access to more and larger computing resources

Easier, more efficient, access to supercomputers

Realtime visualization Computational steering Network enabled solvers New scenarios

Experimental Scientists

Hook up supercomputers with instruments (telescopes, microscopes, …)

Advanced visualization and GUI interfaces

Remote control of instruments Access to remote data Management and use of large

distributed data repositories

Governments

Disaster response National defense Long term research and planning Collective power of nations fastest computers,

data archives and intellect to solve problems Strategic computing reserve (environmental

disaster, earthquake, homeland security) National collaboratory: complex scientific and

engineering problems such as global environmental change, space station design

Virtual Organizations “A number of mutually distrustful participants with varying

degrees of prior relationship (perhaps none at all) who want to share resources in order to perform some task.” (Anatomy of the Grid”

Sharing involves direct access to remote software, computers, data and other resources.

Sharing relationships can vary over time, resources involved, nature of allowed access, participants who get access

Span small corporate departments to large groups of people from different organizations around the world

For example:– This class– The LSU numerical relativity group and its collaborators– Astronomical community who have access to virtual

observatories

Virtual Organizations

Three organizations and two VOs

From “The Anatomy of the Grid”

Virtual Organizations

Vary in purpose, scope, size, duration, structure, community and sociology

Common requirements:– Highly flexibly sharing relationships (both client-server and

peer-to-peer)– Sophisticated and precise levels of control over sharing– Delegation– Application of local and global policies

Address QoS, scheduling, co-allocation, accounting, …

How Will They Use It?

Distributed supercomputing– Aggregate computational resources for problems

which can be solved on a single machine (e.g. all workstations in a company, all supercomputers in the world)

– Large problems needing extreme memory, CPU, or other resource

– E.g. astrophysics/numerical relativity: accurate simulations need fine scale detail

– Challenges: latency, coscheduling, scalability, algorithms, performance


High Throughput Computing– Large numbers of loosely coupled or

independent tasks (e.g. leverage unused cycles)

On-Demand Computing– Short term requirements for jobs which

cannot be effectively or conveniently run locally.

– Often driven by cost-performance concerns– Challenges: dynamic requirements, large

numbers of users and resources, security, payment


Data Intensive Computing– Focus on generating new information from

data in geographically distributed repositories, digital libraries, databases

– E.g. High energy physics experiments generate terabytes of data/day, widely distributed collabotators; digital sky surveys

– Challenges: scheduling and configuration of complex, high volume data flows


Collaborative Computing– Enabling human-human interactions e.g.

with shared resources such as data archives and simulations

– Often in terms of a virtual shared space, e.g. a Cave environment

– Challenges: realtime requirements

E-Science

Global collaborations for scientific research

“large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet”

UK E-Science Program http://www.rcuk.ac.uk/escience/

Cyberinfrastructure

Software to support E-Science “An infrastructure based on grids and on application-

specific software, tools, and data repositories that support research in a particular discipline.”

Getting Up To Speed, The Future of Supercomputing (2001)

GridChem project at CCT is building a cyberinfrastructure for computational chemists

UCOMS project at CCT is building a cyberinfrastructure for geoscientists

SCOOP project at CCT is building a cyberinfrastructure for coastal modellers

Looking for generic tools and techniques, driving research

Physicist has new idea !

S1 S2

P1

P2

S1S2

P2P1

SBrill WaveFound a horizon,

try out excision

Look forhorizon

Calculate/OutputGrav. Waves

Calculate/OutputInvariants

Find bestresources

Free CPUs!!

NCSA

SDSC

RZG

LRZ

Archive data

SDSC

Add more resources

Clone job with steered

parameter

Queue time over, find new machine

Archive to LIGOpublic database

New Scenarios enabling new science

E-Science

Global collaborations for scientific research

“large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet”

UK E-Science Program http://www.rcuk.ac.uk/escience/

Cyberinfrastructure

Software to support E-Science “An infrastructure based on grids and on application-

specific software, tools, and data repositories that support research in a particular discipline.”

Getting Up To Speed, The Future of Supercomputing (2001)

GridChem project at CCT is building a cyberinfrastructure for computational chemists

UCOMS project at CCT is building a cyberinfrastructure for geoscientists

SCOOP project at CCT is building a cyberinfrastructure for coastal modellers

Looking for generic tools and techniques, driving research


S1 S2

P1

P2

S1S2

P2P1


try out excision

Look forhorizon



Find bestresources

Free CPUs!!

NCSA

SDSC

RZG

LRZ

Archive data

SDSC

Add more resources


parameter



New Scenarios enabling new science

Examples

High Performance Computing

A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software: programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors.

Numerical Relativity Black holes, neutron stars,

supernovae, gravitational waves Governed by Einsteins Equations:

very complex, need to solve numerically

10 coupled mixed elliptic-hyperbolic PDEs, thousands of terms

High fidelity solutions need more research in numerics/physics … but also larger computers, better infrastructure

Physics currently limited by information technology!

Numerical Relativity

Good motivating example for Grid computing:– Large varied distributed collaborations– Need lots of cycles, storage (currently using

teraflops, terabytes)– Need to share results, codes, parameter files,

…– Need advanced visualization, steering

Parallelisation

Finite difference method with “stencil width” 1

Proc 0

Parallelisation

Split the data to be worked on across the processors you have available

Each processor can then work on a different piece of data at the same time

Proc 0 Proc 1

Parallelisation But there is a downside:

data needs to be exchanged between processors most iterations: e.g. “synchronize”, “global reduction, output

MPI (PVM, OpenMP, …)

Proc 0 Proc 1

Parallel IO In this example just want to

output fields from 2 processors, but it could be 2000

Each processor could write it’s own data to disk

Then the data usually is moved to one place and “recombined” to produce a single coherent file

Proc 0 Proc 1

Parallel IO Alternatively processor

0 can gather data from the other processors and write it all to disk

Usually a combination of these works best … let every nth processor gather data and write to disk

Proc 0 Proc 1

Large Scale Computing PARALLEL: Typical runs they do now needs 45GB of memory:

– 171 Grid Functions– 400x400x200 grid

OPTIMIZE: Typical run makes 3000 iterations with 6000 Flops per grid point: 600 TeraFlops !!

PARALLEL IO/VIZ/DATA: Output of just one Grid Function at just one time step– 256 MB – (320 GB for 10GF every 50 time steps)

CHECKPOINTING: One simulation takes longer than queue times: Need 10-50 hours

STEERING/MONITORING: Computing time is expensive– One simulation: 2500 to 12500 SUs– Need to make each simulation count

Numerical Relativity

Good motivating example for Grid computing:– Large varied distributed collaborations– Need lots of cycles, storage (currently using

teraflops, terabytes)– Need to share results, codes, parameter files,

…– Need advanced visualization, data

management, steering– Connection to experimental equipment (LIGO

Gravitational Wave Detector) and data.

Numerical Relativity How do computational physicists work now? Accounts on different machines: LSU, NCSA, NERSC, PSC, SDSC,

LRZ, RZG, … Learn how to use each machine

– Compilers, filesystem, scheduler, MPI, policies, … Ssh to machine, copy source code, compile, determine e.g. how

much output can do in file system, how big a run should be, best queue to submit to, submit batch script

Wait till run starts, keep logging in to check if it is still running, what is happening …

Copy all data back to local machine for visualization and analysis Email colleagues and explain what they saw. Loose data, forget what they ran. Publish paper


S1 S2

P1

P2

S1S2

P2P1


try out excision

Look forhorizon



Find bestresources

Free CPUs!!

NCSA

SDSC

RZG

LRZ

Archive data

SDSC

Add more resources


parameter



New Scenarios

TeraGrid



TeraGrid: teragrid.org

“Cyber-infrastructure” constructed through NSF TeraScale initiative– 2000: TeraScale Computing System (TCS-1) at PSC, resulting in a

6 TFLOPS computational resource.– 2001: $53M funding. Distributed Terascale Facility (DTF), 15

TFLOPS computational Grid composed of major resources at ANL, Caltech, NCSA, and SDSC. Exploits homogeneity at the microprocessor level, Intel Itanium architecture (Itanium2 and its successor) clusters to maximally leverage software and integration efforts. Homogeneity will offer the user community an initial set of large-scale resources with a high degree of compatibility, reducing effort required to move into the computational Grid environment.

– 2002: $35M funding and PSC joins. Extensible TeraScale Facility (ETF), combines the TCS-1 and DTF resources into a single, 21+ TFLOPS Grid environment and supports extensibility to additional sites and heterogeneity.

– 2003: $10M and four new sites: ORNL, Purdue, Indiana, TACC. 40 TFLOPS and 2 PBs.

– 2005: $150M to enhance and operate TeraGrid: http://www.teragrid.org/news/news05/0817.html

NSF TeraGrid

TeraGrid

Production system (now part of NSF computer time allocations)

Each site has speciality– NCSA: compute-intensive codes– ANL: visualization– SDSC: data-oriented computing– Caltech: scientific collections

TeraGrid: Objectives

Provide an unprecedented increase in the computational capabilities available to the open research community, both in terms of capacity and functionality.

Deploy a distributed “system” using Grid technologies rather than a “distributed computer” with centralized control, allowing the user community to map applications across the computational, storage, visualization, and other resources as an integrated environment.

Create an “enabling cyberinfrastructure” for scientific research in such a way that additional resources (at additional sites) can be readily integrated as well as providing a model that can be reused to create additional Grid systems that may or may not interoperate with TeraGrid (but are technically interoperable nonetheless).

TeraGrid: Design

Resources at different sites automonously managed– E.g. different software locations, user names– Rationale: more scalable and workable

Consistent set of fundamental grid services (Globus based)

Now building higher level services http://www.teragrid.org/about/TeraGrid-

Primer-Sept-02.pdf

http://www.teragrid.org/about/TeraGrid-Primer-Sept-02.pdf

http://www.teragrid.org/about/TeraGrid-Primer-Sept-02.pdf

Basic Grid Concepts

Grid Architecture

Read about in this last weekend in Anatomy of the Grid …

Based on interoperability, extensibility => common (or standard) protocols

which define the mechanisms by which VOs negotiate, establish, manage, use shared resources

From protocols define standard services, APIs and SDKs



Grid Architecture

Fabric

Connectivity

Resource

Collective

Application

Layers

Applications: tools, applications, portals Collective: resource scheduling, information

providing, data management, systems such as MPICH-G, taskfarming, community authorization, accounting

Resource and Connectivity: Secure access to resources and services (communication, data transfer, security)

Fabric: Diverse resources (including local resource specific operations)

Infrastructure

Communication services– Transport and routing– Un/reliable point to point communications, multicast,

…– Bulk-data transport, streaming data, …– Parameters: latency, bandwidth, reliability, fault

tolerance, jitter Information services

– Location and type of services change dynamically– Mechanisms for registering and obtaining

information about resources, services, status, applications, network …

Infrastructure

Naming services– Names for computers, services, applications, data, job ids– Uniform namespace across complete environment– E.g. X.500 naming scheme (directory services), Domain

Name Service (DNS) Data Management and Replication

– Access to files distributed across many servers (e.g. data mining)

– Distributed filesystem must provide a uniform global namespace

– Support range of file I/O protocols– Allow performance optimizations (e.g. caching)

Infrastructure

Security and authorization– Single sign-on– Confidentiality– Authentication (determines a user's or server’s identity)– Authorization (what a user etc is allowed to do)– Delegation/restricted delegation (program can run on users

behalf, maybe with less authorization)– Integration with diverse resources with different

administrations/security solutions (e.g. kerberos, unix, …)– Trust relationships – Support communication/data protection

Monitoring resources and applications

Infrastructure Resource management and scheduling

– Efficient scheduling and deployment of applications across distributed machines

– Management of resources and applications running on them

– User just wants application submission– Cost/efficiency/application constraints/throughput– Coscheduling, advanced reservation, network/data storage

reservation– Accounting

User and administrative GUIs– Interfaces should be intuitive, easy to use, and

heterogeneous.– Typically web based (accessible from anywhere)

Reading for Next Lecture

The following is expected to be read by the next lecture:– The Anatomy of the Grid

Coursework 1

Due Monday August 29th NEW! Wednesday 31st

Essay: “What is Grid Computing”– 5 pages (+ cover page)– Explain what Grid Computing is, and how it differs

from distributed computing, internet technologies and high performance computing

– Explain how Grid Computing could support and advance scientific research

– Explain the potential economic benefit of Grid Computing to the US economy

CCT Eminent Lecture Managing Information on the Net: the Digital Object

Architecture

Dr. Kahn will discuss an architectural approach to managing information on the net. In particular, he will focus on applications where the information may need to be persist over very long periods of time and where it may be moved many times from site to site and platform to platform over its lifetime. An open architecture approach to federated repositories will also be discussed along with applications of the technology.

Robert E. Kahn is Chairman, CEO and President of the Corporation for National Research Initiatives (CNRI), which he founded in 1986 after a thirteen year term at the U.S. Defense Advanced Research Projects Agency (DARPA). Dr. Kahn earned M.A. and Ph.D. degrees from Princeton University in 1962 and 1964 respectively. He worked on the Technical Staff at Bell Laboratories and then became an Assistant Professor of Electrical Engineering at MIT. He was responsible for the system design of the Arpanet, the first packet-switched network. In 1972 he moved to DARPA and subsequently became Director of DARPA's Information Processing Techniques Office (IPTO). He is a co-inventor of the TCP/IP protocols.

Documents

Grid Computing 7700 Fall 2005 Lecture 2: About Grid Computing Gabrielle Allen [email protected] gallen/Teaching