Upload
elwin-barber
View
215
Download
0
Embed Size (px)
Citation preview
Grid Computing 7700Fall 2005
Lecture 2: About Grid Computing
Gabrielle [email protected]
http://www.cct.lsu.edu/~gallen/Teaching
Quick Test
What reason does Foster (2002) give that the Web is not a Grid?
Advances in which area have changed the way we should think about collaboration: a) sensors, b) supercomputers, c) mass storage, d) networks, e) HDTV
What is GGF an acronym for? What speed do gravitational waves travel at?
a) speed of sound, b) speed of light, c) infinite speed, d) 103,457 km/s, e) they do not move
Some History1993 Legion project starts1993 HPF specification released1994 MPI-1 specification released1994 Nimrod project starts (LAN
based)1994 First beowulf cluster1995 Dot.com era starts …1995 Netscape goes public1995 FAFNER: Factoring via Network-
Enabled Recursion1995 I-WAY (Information Wide Area
Year) at SC951995 Globus project (ANL,UC,ISI)
starts1995 Java released by Sun1997 Legion released1997 UNICORE project starts1997 Entropia founded1998 Globus 1.0 released1998 Legion commercial via Applied
Metacomputing (becomes Avaki in 2001)
1999 First Grid Forum
1999 SETI@home 1999 Napster: Centralized file
sharing2000 Microsoft release .NET2000 Gnutella released: P2P file
sharing2001 “Anatomy of the Grid” 2001 NSF announces TeraGrid2001 First Global Grid Forum2001 Cactus, Globus, MPICH-G2
win Gordon Bell prize2002 Earth Simulator: 40TFlop
NEC machine2002 Globus 2.0 released2002 “Physiology of the Grid” 2003 Globus 3.0 released2003 10Gbps transatlantic optical
network demonstrated2005 Globus 4.0 released
2005 TeraGrid awarded $150M
1843 US Congress investigate telegraph technology
1866 Transatlantic telegraph cable laid1901 Transatlantic radio transmission1965 Multics developers envisage
utility computing1969 Unix is developed1970 ARPANET: DoD exerimental WAN,
precusor to internet1972 C written by Ritchie1975 Microsoft founded1980s Parallel computing: algorithms,
programs and architectures1980s “Grand Challenge” applications1985 NSFNET: Links SC centers at 56
kbps1988 Condor project starts (LAN
based)1989 “Metacomputing” term (CASA
project)1990 HTML developed by Tim Berners-
Lee, first browsers1991 Linus Thorvalds works on Linux1993 Mosaic browser released
Fernando Corbato
Designer of multics OS– Mainframe timesharing
OS– Lead to UNIX
In 1965 envisaged a computer facility “like a power company or water company”
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
J. C. R. Licklider
Experimental psychologist Envisaged a “grid” for
scientific research Contributed to development
of ARPANET 1968: Developed a vision of
networked computers that would provide fast, automated support for human decision making
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Len Kleinrock
Created the basic principles of packet switching, the technology underpinning the Internet, while a graduate student at MIT
His computer was the first node on the internet
Envisaged spread of computer utilities (1969)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
“Grand Challenges”
Fundamental problems in science and engineering with broad economic and scientific impact. They are generally considered intractable without the use of state-of-the-art massively parallel computers
Used by funding agencies from the 80s onwards to motivate advances in science and high performance computing
Brought together distributed teams who started to collaborate around their machines, codes, data, etc
I-WAY: SC95 High speed experimental distributed
computing project. Set up ATM network connecting
supercomputers, mass storage, advanced viz devices at 17 US sites.
30 software engineers, 60 applications, 10 networks (most OC-3c/155Mbps)
Application focused (remote viz, metacomputing, collaboration)
Single interface to schedule and start runs
I-POP machines (17) coordinated I-WAY “virtual machines”, gateways to the I-WAY
I-Soft software for management/programming
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Aims of I-WAY
Develop network enabled tools and build collaborative environments on existing networks with differing protocols and properties
Locating and accessing distributed resources Security and reliability Use of distributed resources for computation Uniform access to distributed data Coupling distributed resources
I-WAY Infrastructure I-POP: gateways to I-WAY
– Dedicated point of present machines at each site
– Uniformly configured with standard software environment
– Accessible from internet, inside firewall
– ATM interface for monitoring/management of ATM switch
I-Soft: management and application programming environment– Ran on I-POP machines– Provided uniform authentification, resource reservation, process creation,
communication functions– CRB: Computational Resource Broker (central scheduler)– Security: Telnet client amended with Kerberos authentication and encryption– File system: AFS for shared repository – Communication: Nexus adapted (MPICH, CAVEcomm)
From Ian Taylor
I-WAY New Concepts
Point of presence machines at each site Computational resource broker integrates
different local schedulers Uniform authentication environment and trust
relationships between sites Network-aware parallel programming tools to
provide uniform view and optimize communications
Led to Globus from ISI/ANL
Globus Toolkit® History
0
5000
10000
15000
20000
25000
30000
1997 1998 1999 2000 2001 2002
Downloads per Month from ftp.globus.org
DARPA, NSF, and DOE begin funding Grid work
NASA initiatesInformation Power Grid,DOE increases support
Globus Project winsGlobal Information
InfrastructureAward
MPICH-Greleased
The Grid: Blueprint for a New ComputingInfrastructure published
GT 1.0.0Released
Early ApplicationSuccesses Reported
GT 1.1.1Released
GT 1.1.2Released
GT 1.1.3Released
NSF & European CommissionInitiate Many New Grid Projects
GT 1.1.4 andMPICH-G2 Released
Anatomy of the GridPaper Released
FirstEuroGlobusConference
Held inLecce
SignificantCommercial
Interest inGrids
NSF GRIDS CenterInitiated, DOE begins
SciDAC program
GT 2.0 betaReleased
Physiology of the GridPaper Released
GT 2.0Released
GT 2.2Released
Does not include downloads from:NMI, UK eScience, EU Datagrid,IBM, Platform, etc.
1997 1998 1999 2000 2001 2002From Globus TeamFrom Globus Team
Some Application Areas
Life sciences– Computational biology,
bioinformatics, genomics– Access, collecting and
mining data, imaging Engineering
– Aircraft design, modeling and monitoring
Data – High energy physics,
astronomy Physical sciences
– Numerical relativity, material science, geoscience
Collaborations– Sharing, real time
interactivity, visualization, communication
Commercial– Gaming, idle workstations,
climate predication, disaster, cyber security, portals
Education and distance learning
Some Application Types
Minimal communication (embarrassingly parallel)
Staged/linked/workflow Access to Resources Fast throughput Large scale Adaptive Real-time on demand Speculative We will read about these and new application
scenarios later
What are Grids? Provide: “coordinated resource sharing and
problem solving in dynamic, multi-institutional, virtual organizations”
Grids link together people, computers, data, sensors, experimental equipment, visualization systems and networks (Virtual Organizations)
For example, they can provide– Sharing of computer resources– Pooling of information– Access to specialized equipment– Increased efficiency and on-demand computing– Enable distributed collaborations
Need to think about hardware, software, applications and policies.
Grid Checklist
A Grid … Coordinates resources that are not subject to
centralized control Uses standard, open, general purpose
protocols and interfaces Delivers non-trivial qualities of service
Ian Foster, “What is the Grid? A Three Point Checklist”, 2002
Grid ResourcesComputers Any networked CPU Supercomputers &
Clusters Workstations Home PCs PDAs Telephones Game machines Very different
properties: clock speed, memory, cache, FPUs, memory bandwidth, OS, software
Devices Sensors Telescopes Gravitational wave
detectors Microscopes Synchrotrons Medical scanners Etc
Data Belonging to a single
user or shared across a VO
Global distributed databases (e.g. NVO, Genome)
Storage devices Security, access
considerations
Visualization Servers Renderers Access Grid Eg CCT Imaginarium
Networks High speed optical networks
(e.g. NLR) Academic networks:
Internet2 Commercial network
providers Wireless, bluetooth, 3G, etc.
Characteristics
Different heterogeneous resources from different organizations
Mutually distrustful organizations Differing security requirements and policies Dynamic quality of service (machines,
networks etc) Heterogeneous networks Capabilities: Dynamic, adaptive, autonomic,
discovery
Who Will Use The Grid
Computational scientists and engineers Experimental scientists Collaborations Educators Enterprises Governments Health authorities Use cases should be driving Grid
developments, so important to understand needs and translate to requirements.
Computational Scientists and Engineers
Numerical simulation, access to more and larger computing resources
Easier, more efficient, access to supercomputers
Realtime visualization Computational steering Network enabled solvers New scenarios
Experimental Scientists
Hook up supercomputers with instruments (telescopes, microscopes, …)
Advanced visualization and GUI interfaces
Remote control of instruments Access to remote data Management and use of large
distributed data repositories
Governments
Disaster response National defense Long term research and planning Collective power of nations fastest computers,
data archives and intellect to solve problems Strategic computing reserve (environmental
disaster, earthquake, homeland security) National collaboratory: complex scientific and
engineering problems such as global environmental change, space station design
Virtual Organizations “A number of mutually distrustful participants with varying
degrees of prior relationship (perhaps none at all) who want to share resources in order to perform some task.” (Anatomy of the Grid”
Sharing involves direct access to remote software, computers, data and other resources.
Sharing relationships can vary over time, resources involved, nature of allowed access, participants who get access
Span small corporate departments to large groups of people from different organizations around the world
For example:– This class– The LSU numerical relativity group and its collaborators– Astronomical community who have access to virtual
observatories
Virtual Organizations
Three organizations and two VOs
From “The Anatomy of the Grid”
Virtual Organizations
Vary in purpose, scope, size, duration, structure, community and sociology
Common requirements:– Highly flexibly sharing relationships (both client-server and
peer-to-peer)– Sophisticated and precise levels of control over sharing– Delegation– Application of local and global policies
Address QoS, scheduling, co-allocation, accounting, …
How Will They Use It?
Distributed supercomputing– Aggregate computational resources for problems
which can be solved on a single machine (e.g. all workstations in a company, all supercomputers in the world)
– Large problems needing extreme memory, CPU, or other resource
– E.g. astrophysics/numerical relativity: accurate simulations need fine scale detail
– Challenges: latency, coscheduling, scalability, algorithms, performance
How Will They Use It?
High Throughput Computing– Large numbers of loosely coupled or
independent tasks (e.g. leverage unused cycles)
On-Demand Computing– Short term requirements for jobs which
cannot be effectively or conveniently run locally.
– Often driven by cost-performance concerns– Challenges: dynamic requirements, large
numbers of users and resources, security, payment
How Will They Use It?
Data Intensive Computing– Focus on generating new information from
data in geographically distributed repositories, digital libraries, databases
– E.g. High energy physics experiments generate terabytes of data/day, widely distributed collabotators; digital sky surveys
– Challenges: scheduling and configuration of complex, high volume data flows
How Will They Use It?
Collaborative Computing– Enabling human-human interactions e.g.
with shared resources such as data archives and simulations
– Often in terms of a virtual shared space, e.g. a Cave environment
– Challenges: realtime requirements
E-Science
Global collaborations for scientific research
“large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet”
UK E-Science Program http://www.rcuk.ac.uk/escience/
Cyberinfrastructure
Software to support E-Science “An infrastructure based on grids and on application-
specific software, tools, and data repositories that support research in a particular discipline.”
Getting Up To Speed, The Future of Supercomputing (2001)
GridChem project at CCT is building a cyberinfrastructure for computational chemists
UCOMS project at CCT is building a cyberinfrastructure for geoscientists
SCOOP project at CCT is building a cyberinfrastructure for coastal modellers
Looking for generic tools and techniques, driving research
Physicist has new idea !
S1 S2
P1
P2
S1S2
P2P1
SBrill WaveFound a horizon,
try out excision
Look forhorizon
Calculate/OutputGrav. Waves
Calculate/OutputInvariants
Find bestresources
Free CPUs!!
NCSA
SDSC
RZG
LRZ
Archive data
SDSC
Add more resources
Clone job with steered
parameter
Queue time over, find new machine
Archive to LIGOpublic database
New Scenarios enabling new science
E-Science
Global collaborations for scientific research
“large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet”
UK E-Science Program http://www.rcuk.ac.uk/escience/
Cyberinfrastructure
Software to support E-Science “An infrastructure based on grids and on application-
specific software, tools, and data repositories that support research in a particular discipline.”
Getting Up To Speed, The Future of Supercomputing (2001)
GridChem project at CCT is building a cyberinfrastructure for computational chemists
UCOMS project at CCT is building a cyberinfrastructure for geoscientists
SCOOP project at CCT is building a cyberinfrastructure for coastal modellers
Looking for generic tools and techniques, driving research
Physicist has new idea !
S1 S2
P1
P2
S1S2
P2P1
SBrill WaveFound a horizon,
try out excision
Look forhorizon
Calculate/OutputGrav. Waves
Calculate/OutputInvariants
Find bestresources
Free CPUs!!
NCSA
SDSC
RZG
LRZ
Archive data
SDSC
Add more resources
Clone job with steered
parameter
Queue time over, find new machine
Archive to LIGOpublic database
New Scenarios enabling new science
Examples
High Performance Computing
A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software: programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors.
Numerical Relativity Black holes, neutron stars,
supernovae, gravitational waves Governed by Einsteins Equations:
very complex, need to solve numerically
10 coupled mixed elliptic-hyperbolic PDEs, thousands of terms
High fidelity solutions need more research in numerics/physics … but also larger computers, better infrastructure
Physics currently limited by information technology!
Numerical Relativity
Good motivating example for Grid computing:– Large varied distributed collaborations– Need lots of cycles, storage (currently using
teraflops, terabytes)– Need to share results, codes, parameter files,
…– Need advanced visualization, steering
Parallelisation
Finite difference method with “stencil width” 1
Proc 0
Parallelisation
Split the data to be worked on across the processors you have available
Each processor can then work on a different piece of data at the same time
Proc 0 Proc 1
Parallelisation But there is a downside:
data needs to be exchanged between processors most iterations: e.g. “synchronize”, “global reduction, output
MPI (PVM, OpenMP, …)
Proc 0 Proc 1
Parallel IO In this example just want to
output fields from 2 processors, but it could be 2000
Each processor could write it’s own data to disk
Then the data usually is moved to one place and “recombined” to produce a single coherent file
Proc 0 Proc 1
Parallel IO Alternatively processor
0 can gather data from the other processors and write it all to disk
Usually a combination of these works best … let every nth processor gather data and write to disk
Proc 0 Proc 1
Large Scale Computing PARALLEL: Typical runs they do now needs 45GB of memory:
– 171 Grid Functions– 400x400x200 grid
OPTIMIZE: Typical run makes 3000 iterations with 6000 Flops per grid point: 600 TeraFlops !!
PARALLEL IO/VIZ/DATA: Output of just one Grid Function at just one time step– 256 MB – (320 GB for 10GF every 50 time steps)
CHECKPOINTING: One simulation takes longer than queue times: Need 10-50 hours
STEERING/MONITORING: Computing time is expensive– One simulation: 2500 to 12500 SUs– Need to make each simulation count
Numerical Relativity
Good motivating example for Grid computing:– Large varied distributed collaborations– Need lots of cycles, storage (currently using
teraflops, terabytes)– Need to share results, codes, parameter files,
…– Need advanced visualization, data
management, steering– Connection to experimental equipment (LIGO
Gravitational Wave Detector) and data.
Numerical Relativity How do computational physicists work now? Accounts on different machines: LSU, NCSA, NERSC, PSC, SDSC,
LRZ, RZG, … Learn how to use each machine
– Compilers, filesystem, scheduler, MPI, policies, … Ssh to machine, copy source code, compile, determine e.g. how
much output can do in file system, how big a run should be, best queue to submit to, submit batch script
Wait till run starts, keep logging in to check if it is still running, what is happening …
Copy all data back to local machine for visualization and analysis Email colleagues and explain what they saw. Loose data, forget what they ran. Publish paper
Physicist has new idea !
S1 S2
P1
P2
S1S2
P2P1
SBrill WaveFound a horizon,
try out excision
Look forhorizon
Calculate/OutputGrav. Waves
Calculate/OutputInvariants
Find bestresources
Free CPUs!!
NCSA
SDSC
RZG
LRZ
Archive data
SDSC
Add more resources
Clone job with steered
parameter
Queue time over, find new machine
Archive to LIGOpublic database
New Scenarios
TeraGrid
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
TeraGrid: teragrid.org
“Cyber-infrastructure” constructed through NSF TeraScale initiative– 2000: TeraScale Computing System (TCS-1) at PSC, resulting in a
6 TFLOPS computational resource.– 2001: $53M funding. Distributed Terascale Facility (DTF), 15
TFLOPS computational Grid composed of major resources at ANL, Caltech, NCSA, and SDSC. Exploits homogeneity at the microprocessor level, Intel Itanium architecture (Itanium2 and its successor) clusters to maximally leverage software and integration efforts. Homogeneity will offer the user community an initial set of large-scale resources with a high degree of compatibility, reducing effort required to move into the computational Grid environment.
– 2002: $35M funding and PSC joins. Extensible TeraScale Facility (ETF), combines the TCS-1 and DTF resources into a single, 21+ TFLOPS Grid environment and supports extensibility to additional sites and heterogeneity.
– 2003: $10M and four new sites: ORNL, Purdue, Indiana, TACC. 40 TFLOPS and 2 PBs.
– 2005: $150M to enhance and operate TeraGrid: http://www.teragrid.org/news/news05/0817.html
NSF TeraGrid
TeraGrid
Production system (now part of NSF computer time allocations)
Each site has speciality– NCSA: compute-intensive codes– ANL: visualization– SDSC: data-oriented computing– Caltech: scientific collections
TeraGrid: Objectives
Provide an unprecedented increase in the computational capabilities available to the open research community, both in terms of capacity and functionality.
Deploy a distributed “system” using Grid technologies rather than a “distributed computer” with centralized control, allowing the user community to map applications across the computational, storage, visualization, and other resources as an integrated environment.
Create an “enabling cyberinfrastructure” for scientific research in such a way that additional resources (at additional sites) can be readily integrated as well as providing a model that can be reused to create additional Grid systems that may or may not interoperate with TeraGrid (but are technically interoperable nonetheless).
TeraGrid: Design
Resources at different sites automonously managed– E.g. different software locations, user names– Rationale: more scalable and workable
Consistent set of fundamental grid services (Globus based)
Now building higher level services http://www.teragrid.org/about/TeraGrid-
Primer-Sept-02.pdf
Basic Grid Concepts
Grid Architecture
Read about in this last weekend in Anatomy of the Grid …
Based on interoperability, extensibility => common (or standard) protocols
which define the mechanisms by which VOs negotiate, establish, manage, use shared resources
From protocols define standard services, APIs and SDKs
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Grid Architecture
Fabric
Connectivity
Resource
Collective
Application
Layers
Applications: tools, applications, portals Collective: resource scheduling, information
providing, data management, systems such as MPICH-G, taskfarming, community authorization, accounting
Resource and Connectivity: Secure access to resources and services (communication, data transfer, security)
Fabric: Diverse resources (including local resource specific operations)
Infrastructure
Communication services– Transport and routing– Un/reliable point to point communications, multicast,
…– Bulk-data transport, streaming data, …– Parameters: latency, bandwidth, reliability, fault
tolerance, jitter Information services
– Location and type of services change dynamically– Mechanisms for registering and obtaining
information about resources, services, status, applications, network …
Infrastructure
Naming services– Names for computers, services, applications, data, job ids– Uniform namespace across complete environment– E.g. X.500 naming scheme (directory services), Domain
Name Service (DNS) Data Management and Replication
– Access to files distributed across many servers (e.g. data mining)
– Distributed filesystem must provide a uniform global namespace
– Support range of file I/O protocols– Allow performance optimizations (e.g. caching)
Infrastructure
Security and authorization– Single sign-on– Confidentiality– Authentication (determines a user's or server’s identity)– Authorization (what a user etc is allowed to do)– Delegation/restricted delegation (program can run on users
behalf, maybe with less authorization)– Integration with diverse resources with different
administrations/security solutions (e.g. kerberos, unix, …)– Trust relationships – Support communication/data protection
Monitoring resources and applications
Infrastructure Resource management and scheduling
– Efficient scheduling and deployment of applications across distributed machines
– Management of resources and applications running on them
– User just wants application submission– Cost/efficiency/application constraints/throughput– Coscheduling, advanced reservation, network/data storage
reservation– Accounting
User and administrative GUIs– Interfaces should be intuitive, easy to use, and
heterogeneous.– Typically web based (accessible from anywhere)
Reading for Next Lecture
The following is expected to be read by the next lecture:– The Anatomy of the Grid
Coursework 1
Due Monday August 29th NEW! Wednesday 31st
Essay: “What is Grid Computing”– 5 pages (+ cover page)– Explain what Grid Computing is, and how it differs
from distributed computing, internet technologies and high performance computing
– Explain how Grid Computing could support and advance scientific research
– Explain the potential economic benefit of Grid Computing to the US economy
CCT Eminent Lecture Managing Information on the Net: the Digital Object
Architecture
Dr. Kahn will discuss an architectural approach to managing information on the net. In particular, he will focus on applications where the information may need to be persist over very long periods of time and where it may be moved many times from site to site and platform to platform over its lifetime. An open architecture approach to federated repositories will also be discussed along with applications of the technology.
Robert E. Kahn is Chairman, CEO and President of the Corporation for National Research Initiatives (CNRI), which he founded in 1986 after a thirteen year term at the U.S. Defense Advanced Research Projects Agency (DARPA). Dr. Kahn earned M.A. and Ph.D. degrees from Princeton University in 1962 and 1964 respectively. He worked on the Technical Staff at Bell Laboratories and then became an Assistant Professor of Electrical Engineering at MIT. He was responsible for the system design of the Arpanet, the first packet-switched network. In 1972 he moved to DARPA and subsequently became Director of DARPA's Information Processing Techniques Office (IPTO). He is a co-inventor of the TCP/IP protocols.