9
T he scientific user’s computing de- mands are becoming increasingly complex: a scientific application, for example, can comprise simulations, database access, and interactive visualization for data analysis. Each task could require computing resources that are distributed geographically and come from several administrative domains. Com- pared to operating on a single machine, effectively marshalling these distributed systems introduces extra complications because users must pay atten- tion to data transfer, job submission, resource al- location, and authorization across the federated resources. Grid computing middleware (such as the Globus Toolkit) simplifies this problem by provid- ing frameworks that abstract a task’s technical de- tails, but much of this middleware is still under development and often demands a significant in- vestment of time and effort to learn the required programming models and techniques. It will be some time before we can write high- end scientific applications and code from scratch using grid-programming models and frameworks—for example, a scalable, high- performance molecular dynamics (MD) code that can work on a single problem distributed over a grid. As it stands, potentially extensive code refac- toring and interfacing is required to convert legacy programs into grid applications. Consequently, most scientific computation on grids represents first-generation grid applications—typically, exist- ing program code minimally modified to exploit grid technologies. The degree to which we can modify the code varies, but essentially, applications retain the same programming paradigms. MPICH- G2, 1 for example, is an attempt to generalize the message-passing interface (MPI) over Globus grid resources, retaining the MPI interface but imple- menting it in a grid environment. The RealityGrid project (www.RealityGrid. org) has adopted a fast-track approach to scientific ap- plications on grids. Although we understand the im- portance of developing careful, rigorous, and well-analyzed approaches to grid-application pro- gramming, we believe that it is at least as important to begin using the available grid infrastructure for sci- entific applications with pre-existing code to get meaningful scientific results—the hope is to create both a technology push and an applications pull. As a first step, the RealityGrid project (which contains var- ious subprojects including the TeraGyroid and Free Energy calculations project) has developed a software 24 COMPUTING IN SCIENCE & ENGINEERING S CIENTIFIC GRID C OMPUTING: T HE F IRST GENERATION G RID C OMPUTING The scientific user’s computing demands are becoming increasingly complex and can benefit from distributed resources, but effectively marshalling these distributed systems often introduces new challenges. The authors describe how researchers can exploit existing distributed grid infrastructure to get meaningful scientific results. JONATHAN CHIN, MATTHEW J. HARVEY, SHANTENU JHA, AND PETER V. COVENEY University College London 1521-9615/05/$20.00 © 2005 IEEE Copublished by the IEEE CS and the AIP

Scientific Grid Computing: The First Generation

  • Upload
    j-chin

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scientific Grid Computing: The First Generation

The scientific user’s computing de-mands are becoming increasinglycomplex: a scientific application, forexample, can comprise simulations,

database access, and interactive visualization fordata analysis. Each task could require computingresources that are distributed geographically andcome from several administrative domains. Com-pared to operating on a single machine, effectivelymarshalling these distributed systems introducesextra complications because users must pay atten-tion to data transfer, job submission, resource al-location, and authorization across the federatedresources. Grid computing middleware (such as theGlobus Toolkit) simplifies this problem by provid-ing frameworks that abstract a task’s technical de-tails, but much of this middleware is still underdevelopment and often demands a significant in-vestment of time and effort to learn the requiredprogramming models and techniques.

It will be some time before we can write high-end scientific applications and code from scratch

using grid-programming models andframeworks—for example, a scalable, high-performance molecular dynamics (MD) code thatcan work on a single problem distributed over agrid. As it stands, potentially extensive code refac-toring and interfacing is required to convert legacyprograms into grid applications. Consequently,most scientific computation on grids representsfirst-generation grid applications—typically, exist-ing program code minimally modified to exploitgrid technologies. The degree to which we canmodify the code varies, but essentially, applicationsretain the same programming paradigms. MPICH-G2,1 for example, is an attempt to generalize themessage-passing interface (MPI) over Globus gridresources, retaining the MPI interface but imple-menting it in a grid environment.

The RealityGrid project (www.RealityGrid.org) has adopted a fast-track approach to scientific ap-plications on grids. Although we understand the im-portance of developing careful, rigorous, andwell-analyzed approaches to grid-application pro-gramming, we believe that it is at least as importantto begin using the available grid infrastructure for sci-entific applications with pre-existing code to getmeaningful scientific results—the hope is to createboth a technology push and an applications pull. As afirst step, the RealityGrid project (which contains var-ious subprojects including the TeraGyroid and FreeEnergy calculations project) has developed a software

24 COMPUTING IN SCIENCE & ENGINEERING

SCIENTIFIC GRID COMPUTING: THE FIRST GENERATION

G R I DC O M P U T I N G

The scientific user’s computing demands are becoming increasingly complex and canbenefit from distributed resources, but effectively marshalling these distributed systemsoften introduces new challenges. The authors describe how researchers can exploit existingdistributed grid infrastructure to get meaningful scientific results.

JONATHAN CHIN, MATTHEW J. HARVEY, SHANTENU JHA, AND PETER V. COVENEY

University College London

1521-9615/05/$20.00 © 2005 IEEE

Copublished by the IEEE CS and the AIP

Page 2: Scientific Grid Computing: The First Generation

SEPTEMBER/OCTOBER 2005 25

infrastructure designed to enable existing high-endscientific application use on grids.2,3 In particular, weintend for this infrastructure to facilitate computationalsteering of existing code—letting the user influencethe otherwise sequential simulation and analysisphases by merging and interleaving them.

Framework for Grid-BasedComputational SteeringThe RealityGrid steering library, from the appli-cation’s perspective, is its point of contact with

the grid world. In designing and implementingthe steering library, we aimed to enable existingprograms (often written in Fortran 90 or C anddesigned for multiprocessor supercomputers) tobe rendered steerable with minimum effort. Thesteering library is implemented in C; has Fortran90, Java, Python, and Perl bindings; and permitsany parallelization technique (such as MPI orOpenMP) with the proviso that the applicationdeveloper assumes responsibility for communi-cating any changes resulting from the steering ac-

Computational Steering API

Because the ultimate aim of grid computing is to let userstransparently perform tasks across a distributed compu-

tational resource without needing to be aware of how thatresource is changing with time, there is a critical need forshielding the complex and changing underlying grid infra-structure from the end user. One way is with stable, general-purpose APIs that provide a grid-programming mechanismindependent of underlying details—the message-passing in-terface (MPI) specification and API provide such an API formessage passing in parallel programming environments.The RealityGrid steering library and its API attempt to pro-vide such an interface for a wide spectrum of grid applica-tions taken from a range of application domains.

To give a flavor of the RealityGrid Steering API, let’s walkthrough its application-side routines. Some of these librarycalls are self-explanatory, such as the routines required forstartup and shutdown (Steering_initialize andSteering_finalize). (Details of these functions and allothers are available elsewhere.1) The application uses theroutine Register_params to register the list of parame-ters, both monitored (a steering client can read them) andsteerable (a steering client can send them). The call has acumulative effect: every successful call to the Register_params function results in the addition of the specifiedparameters to the list. Corresponding client-side operationscan discover any registered parameters.

The next set of routines provides the ability to input andoutput data from the simulations. Because an applicationmight require several different types of input and output,each type must be registered by a call to Register_IO-Types, which associates a label with each type. The applica-tion’s data output is initiated by a call to Emit_start, datais transferred by one or more calls to Emit_data_slice,and the operation is terminated with Emit_stop. Thesefunctions are analogous to fopen, fwrite, and fclose.Likewise, data input is performed via a series of calls analo-gous to fopen(), fread(), and fclose():—in thiscase, Consume_start, Consume_data_slice, and Con-

sume_stop. Each call to Consume_data_slice is precededby a call to Consume_data_slice_header, which sets thetype and amount of data the application is about to receive.The application is unaware of the data’s source or destina-tion and provided with only the label to the library. It’s thelibrary’s responsibility to determine the data’s source or des-tination and transport mechanism.

The library uses emit and consume semantics because theapplication shouldn’t be aware of the data’s destination orsource. Windback means revert to the state captured in aprevious checkpoint without stopping the application. Inthe RealityGrid framework, it’s the application’s responsibil-ity to create a checkpoint file.2

The application gives the library an opportunity to han-dle events by making a call to Steering_control. One ofthe first important decisions an application developer typi-cally encounters is how to determine a logical breakpoint inthe code where the application is in a coherent state—thatis, the monitored parameters are up to date, changes to thevalues of steerable parameters are possible, and a consis-tent representation of the system state can be captured in acheckpoint file. For most scientific codes, this should bestraightforward; for example, in molecular dynamics code,it’s appropriate for the time-step loop to begin with a call toSteering_control. The library automatically reports thevalues of monitored parameters to the steering client, up-dates the values of any steerable parameters that havechanged, and notifies the application of any additional op-erations it should perform. The application is required toprocess the additional commands (such as emit a check-point, emit a sample of a certain type, or stop).

References1. S. Pickles et al., The RealityGrid Computational Steering API Version 1.1,

RealityGrid tech. report, 2005; www.sve.man.ac.uk/Research/AtoZ/

RealityGrid/Steering/ReG_Steering_API_v1.2b.pdf.

2. K.R. Mayes et al., “Towards Performance Control on the Grid,” to be

published in Philosophical Trans. Royal Soc. London A, vol. 363, no. 1833,

2005; www.pubs.royalsoc.ac.uk/philtransa.shtml.

Page 3: Scientific Grid Computing: The First Generation

26 COMPUTING IN SCIENCE & ENGINEERING

tivity to all processes. The steering library re-quires the application to register monitored andsteered parameters. Similarly, the user can in-struct the application to save its entire state to acheckpoint file, or restart from an existing one.(See the “Computational Steering API” sidebarfor more details.) We use a system of reverse com-munication with the application—that is, the li-brary notifies the application when emit,consume, checkpoint, and windback operationsoccur, thus delegating the task’s execution to theapplication.

Figure 1 shows a schematic representation of ourcomputational steering architecture for the case inwhich a visualization component is connected to asimulation component.4 A scientist might steer oneor both of these components via the steering client.To create a distributed steered simulation, welaunch the components independently and attachand detach them dynamically. The service-orientedmiddle tier solves numerous issues that are prob-lematic for a two-tier architecture, particularly theproblem of routing messages to components thatreside on resources behind a firewall.

By portraying the knobs (control) and dials (sta-tus) of computational steering as Web serviceoperations, we can easily document the computa-tional steering protocol in the Web Services De-scription Language, make it accessible in severalother languages, and permit independent devel-

opment of steering clients. Moreover, theseclients can be stand-alone, customizable, Web-based, or embedded in a modular visualization en-vironment (such as AVS, www.avs.com; Amira,www.amiravis.com; or Iris Explorer, www.nag.co.uk/welcome_iec.asp). The mediating steeringgrid service (SGS) acts as a white board: the clientpushes control commands to—and pulls status in-formation from—the SGS, while the applicationperforms the converse operations. The SGS isstateful and transient (with state and lifetime re-flecting the component being steered), and is anatural application for the service data constructsas well as factory and lifetime management pat-terns provided by the Open Grid Services Infra-structure (OGSI).5 (Because the OGSI is nowdefunct, work is underway to transfer the SGS tothe Web Services Resource Framework[WSRF].6) Steering clients can either use theclient-side part of the steering API to communi-cate with the SGS or directly drive operations onthe SGS using standard Web service methods.

TeraGyroidProblems involving fluid flow are conventionallytreated with the Navier-Stokes equations, a set ofpartial differential equations describing the evolu-tion of a fluid’s pressure and velocity. These equa-tions are notoriously nonlinear and difficult tosolve analytically, so they’re usually discretized and

Steeringgrid service

Steeringgrid service

Registry

Visualization

Steering library

Simulation

Steering library

Client

Display

Display

Display

Components startindependently and

attach/detachdynamically

Remote visualization throughSGI VizServer, Chromium, and/or

streamed to AccessGrid

Multiple clients:Qt/C++, .NET on PocketPC,

GridSphere Portlet (Java)

bind

bind

find

publish

publish

connectData transfer(Globus-IO)

Steering library

Figure 1. Schematic architecture of an archetypal RealityGrid steering configuration.4 The components communicate byexchanging messages through intermediate grid services. The dotted arrows indicate the optional use of a third-party filetransfer mechanism and the option of the visualizer sending messages directly to the simulation.

Page 4: Scientific Grid Computing: The First Generation

SEPTEMBER/OCTOBER 2005 27

solved numerically. This is nonetheless a contin-uum approach: the fluid is treated as an arbitrarilyfinely divisible material, with no information aboutthe behavior of its constituent particles. Hence, itis insufficiently detailed for problems in whichfluid molecules exhibit behavior more complicatedthan bulk flow.

Unfortunately, the MD approach of integrat-ing Newton’s equations of motion for each mol-ecule would be far too expensive for systems inwhich bulk flow remains important: a single cu-bic centimeter of water contains approximately1022 molecules, which is 16 orders of magnitudelarger than the largest MD simulations currentlypossible.

The lattice-Boltzmann (LB) method7 is amember of a family of techniques called mesoscalemethods operating at a kinetic description level—more coarse-grained than MD, but more finelygrained than Navier-Stokes. The LB method,which we can view as a discretized form of theBoltzmann transport equation of kinetic theory,8

is a cellular automaton technique that can containinformation about the overall distribution of mol-ecular speeds and orientations, without describ-ing each individual molecule. LB is a convenientformulation of fluid dynamics. Because the algo-rithm is extremely simple and fast to implement,it doesn’t require that we solve the Poisson equa-tion at each time step, unlike conventional fluid-dynamics techniques, nor does it require anycomplicated meshing. (All the simulations men-tioned in this article were performed on regularcubic grids.) As such, the LB method is now com-peting with traditional computational fluid dy-namics (CFD) methods for turbulent flowmodeling; in fact, a company called Exa (www.exa.com) is now selling an LB-based system foraerodynamic modeling to the automotive indus-try. Moreover, the LB method’s cellular automa-ton nature makes it extremely amenable toimplementation on parallel computing hardware.

Amphiphile mesophases are a good example offluid systems that we can’t treat with a continuummodel, yet in which flow effects are still impor-tant. An amphiphile is a molecule composed of awater-loving “head” part and an oil-loving “tail”part—soap molecules are amphiphiles and are at-tracted to oil–water interfaces (each part of themolecule can sit in its preferred environment). Ifamphiphiles are dissolved in pure water, there isno oil to attract the tail groups, so the amphiphilemolecules cluster together with the tail groups,shielded from the water by the head groups. Thegeometry of these clusters depends on a range of

parameters, such as concentration, temperature,or pH level. They can range from simple spheri-cal micelle clusters to linear or branching worm-like micelles, irregular sponge phases, or highlyregular liquid-crystal phases, which can consist ofinterpenetrating oil and water channels separatedby a monolayer of amphiphile, or water channelsseparated by an amphiphile bilayer. Modeling orotherwise trying to understand these phases’structure and behavior by considering individualmolecules is extremely difficult, given that ap-proximately 2,000 to 3,000 atoms can exist in asingle micelle. Researchers have had some suc-cess9 in modeling amphiphile mesophases withfree-energy techniques, but these studies yield lit-tle dynamical information, whereas mesoscalefluid algorithms such as LB are ideally suited tosuch problems.

The TeraGyroid project, a collaboration be-tween RealityGrid and many other groups, isbuilt on RealityGrid’s earlier smaller-scale simu-lations of the spontaneous self-assembly of astructure called a gyroid, from a mixture of oil,water, and amphiphile. This structure has the cu-rious property that the amphiphile molecules siton a “minimal surface” with zero mean curvatureat all points. The gyroid is also closely related tostructures found in living cells, and it has muchpotential to create novel materials and photoniccrystals. Previous simulations showed that it waspossible to model the self-assembly process usingan LB technique, but the simulations quickly be-came unphysical due to the finite-size effects.Consequently, the simulations tended to containa single homogeneous region of the mesophase,in contrast to the multiple-grain structure wemight expect in a real-world system. We pre-dicted that at larger system sizes and over longersimulation runtimes, we could simulate the self-assembly of several spatially distinct domains: dif-ferently oriented gyroid structures wouldassemble and, where they met, defect regionswould exist, analogous to domain walls in ferro-magnets. To investigate these defect regions, werequired simulations on grids of up to 1,0243 lat-tice points. The aim of the TeraGyroid projectwas to use grid computing techniques and re-sources to perform these very large-scalecalculations, which would otherwise have beenimpossible.

TeraGyroid Technical DetailsThe principal problem the TeraGyroid projectfaced was the very large system sizes (as well as timescales) we had to simulate to reach the physical

Page 5: Scientific Grid Computing: The First Generation

28 COMPUTING IN SCIENCE & ENGINEERING

regime of interest. This challenge demanded rawcomputing power as well as visualization, storage,and management facilities.

Memory and CPUThe simulation program, LB3D, required ap-proximately 1 Kbyte of memory per lattice site.Hence, a system with a lattice of 1283 sites re-quired roughly 2.2 Gbytes of total memory to run,a single system checkpoint required 2.2 Gbytes ofdisk space, and each block of visualization data,containing a floating-point number for each lat-tice site, took approximately 8 Mbytes. Thelargest system, at 1,0243, produced checkpointfiles approaching a terabyte and emitted visual-ization blocks of 4 Gbytes. The massive simula-tion data sizes meant that distributed-memorymultiprocessing was the only option, but it causedsignificant headaches because restarting the sim-ulation from an on-disk checkpoint could takemore than an hour.

VisualizationAt the outset of the project, a typical workstationcould render up to 643 data sets using either iso-surfacing or volume rendering. A machine with aGeForce3 Ti200 graphics processor and an AMDAthlon XP1700 CPU can render a 643 gyroid iso-surface containing 296,606 triangles at approxi-mately 5 frames per second (fps). Larger data setswould take several minutes of preprocessing andthen require several seconds to render each frame,resulting in a jerky display that makes user inter-action and region-of-interest location rather dif-ficult. Furthermore, data sets of 2563 sites orlarger sometimes simply wouldn’t fit into theavailable graphics memory on a single worksta-tion. The project therefore required parallel ren-dering techniques.

TeraGyroid was a collaborative project involv-ing many institutions and people in the UK andthe US. All these people needed to be able to seewhat was happening during the simulation runsand to discuss the running simulations with eachother. In addition, the project team wanted todemonstrate the steered simulations to an audi-ence at the Supercomputing 2003 conference andto a worldwide audience at the SCGlobal telecon-ference. These requirements necessitated a “fed-erated grid,” joining separate computational gridsacross the Atlantic Ocean to form a single envi-ronment in which scientists would have enoughcomputational power, visualization ability, storagecapacity, and collaboration facilities to perform theTeraGyroid simulations.

Computational SteeringBecause of the nature of the calculations and theresource constraints imposed on them, we ex-pected that the TeraGyroid project would be anideal candidate for computational steering. Butdue to a limited theoretical understanding of theproblem—particularly of its dynamic nature—wedidn’t know beforehand how long the simulationswould have to run before reaching a gyroid do-main regime, how long that regime would last, oreven which regions of the simulation parameterspace would permit gyroids to form. We hopedthat computational steering would let us monitorsimulations as they ran so that we could terminatethose that crashed or entered uninterestingregimes early on and adjust those that showed par-ticularly interesting results to give them higher-resolution data output.

The resources allocated to the project consistedlargely of time slots (ranging from six hours to threedays) during which all (or a large part) of certainmachines would become completely available to theproject. In a conventionally operated batch-queuingsystem, researchers don’t have to worry about try-ing to fill up a whole machine with simultaneoustasks—a sufficiently well-designed queuing systemwill simply let other peoples’ tasks take up unusedresources. This wasn’t the case for TeraGyroid. Thecombination of the somewhat idiosyncratic re-source allocation and the steered jobs meant that, atany time, calculations could run out of time on onemachine, while processors on another machine (po-tentially on another continent) were freed up by aterminating job. This led to the desire for migrat-able simulations, which could move (or at least, bemoved) from machine to machine to make the bestuse of the available processors.

Organizational and Practical DetailsWhen steering a simulation, it became commonpractice to create a checkpoint file immediately be-fore performing the steering action. This meantthat if the steering caused the simulation to crash,it could be “rewound” back to its last known validstate and the calculation could continue from there.Although this approach makes simulation crashesmuch less of a hindrance when performing steeredparameter searches, it tends to produce a largenumber of checkpoint files—keeping track of thesecan involve a lot of administrative work.

The project’s research team dealt with this ad-ministrative overhead by using the CheckpointTree Grid Service, developed by Manchester Com-puting (www.mcc.ac.uk). This service keeps arecord of each generated checkpoint along with

Page 6: Scientific Grid Computing: The First Generation

SEPTEMBER/OCTOBER 2005 29

which simulation it was generated from and whichsteering actions led to its generation. Effectively,this means that researchers can replay any simula-tion from a checkpoint earlier in its life, removingthe administrative burden from the scientist per-forming the steering.

Ensuring that all the required software ransmoothly on the required platforms requires a sig-nificant amount of effort. A major problem with us-ing a heterogeneous grid is that the location andinvocation of compilers and libraries differ widely,even between machines of the same architecture.Environmental parameters, such as the location oftemporary and permanent filespace, file retentionpolicies, or executable paths, also vary widely. Dur-ing the project, we dealt with these issues via ad hocshell scripts, but this isn’t a satisfactory solution be-cause of the amount of work it requires.

We formed the TeraGyroid testbed network byfederating separate grids in the UK and US: eachgrid had to recognize users and certificates fromother grids. During the project, we dealt with thischallenge by communicating with individuals di-rectly and by posting the certificate IDs that had tobe recognized on a Wiki. Again, this isn’t a scalablelong-term solution, and ideally, the issue should bedealt with via third-party certificate managementsystem. Given that grid computing requires trans-parent operation across multiple administrative do-mains, it’s actually questionable whether we caneven apply the term grid to the TeraGyroid project.Certificate management and a public key infra-structure (PKI) are still difficult technical prob-lems10 for the project, although we’re investigatingnew approaches to both.11,12

We also made much use of dual-homed systems,with multiple IP addresses on multiple networks,but this caused problems due to the tendency of au-thentication systems such as SSL to confuse hostidentity with IP address, requiring uglyworkarounds. Most networking software assumesa homogeneous network and delegates routingcontrol to much lower levels. This delegationmakes it difficult, for example, for a client processrunning on one host to move files between twoother hosts using a specific network, especiallywhen we constructed a high-bandwidth networkspecifically for the transfer.

We also encountered problems when the com-puting and visualization nodes weren’t directly con-nected to the Internet; instead, they communicatedthrough firewalls, which is a common situation onlarge clusters. We used workarounds such as port-forwarding (forwarding network connections froma machine outside the firewall to a machine inside)

and process pinning (ensuring that a particular partof the simulation program always runs on a partic-ular physical machine) during the project, butagain, these aren’t good long-term solutions.

The full TeraGyroid simulation pipeline requirescertain resources—such as simulation, visualiza-tion, and storage facilities—and AccessGrid virtualvenues, to be available simultaneously. Systems ad-ministrators usually handled this requirement bymanually reserving resources, but the ideal solutionwould involve automated advance reservation andcoallocation procedures.

Accurate Free Energies in Biomolecular SystemsComputing the free energies (FEs) of biomolecu-lar systems is one of the most computationally de-manding problems in molecular biology—andarguably, one of the most important. The require-ment to compute FEs is widespread—from a needto understand a candidate drug’s interaction with

its target to understanding fundamental transportprocesses in nature. We can use several numericalapproaches to compute an FE, so in addition to thescientific basis of determining which approach touse, we must decide which approach to adapt so asto effectively exploit grid capabilities. Here, we dis-cuss two instances (from other projects that are partof the RealityGrid project) of using computationalgrids to compute FE differences in biological sys-tems. Both examples can be naturally partitionedinto several independent simulations, each requir-ing tightly coupled parallel resources; hence, theyare amenable to a grid-based approach. Because wecan distribute each simulation around the grid, it’seasy to use more high-performance computing(HPC) resources than are available in a single ad-ministrative domain at any given instant. As a re-sult, we can dramatically reduce the time it takes toreach a solution.

Our solutions to both these problems involveusing a traditional parallel MD application thatwe’ve adapted to take advantage of grid execution

A major problem with using a heterogeneous

grid is that the location and invocation of

compilers and libraries differ widely, even

between machines of the same architecture.

Page 7: Scientific Grid Computing: The First Generation

30 COMPUTING IN SCIENCE & ENGINEERING

and frameworks. They adapt the legacy code for agrid framework and, as a consequence of using thegrid, employ a different workflow. Thus in addi-tion to the scientific interest, problems involvingFE computation are, among other things, goodcandidates for determining a scientific grid com-putation’s effectiveness.

Understanding Protein–Peptide Binding Gabriel Waksman and his colleagues’ elucidationof the crystal structure of the Src SH2 domain13

initiated a significant research effort directed to-ward understanding how these proteins bind andrecognize specific peptide sequences. MeasuringFE differences (�G) and understanding the ther-modynamics of SH2-mediated binding is vital tounderstanding fundamental physical processes atplay at the molecular level. Thermodynamic in-tegration (TI) provides a formalism to compute—

in principle, exactly—the FE difference betweentwo molecules A and B (��GAB) as they bind to agiven SH2 protein domain. The key concept inTI14 is that of a thermodynamic cycle—varyingthe value of � from 0 (peptide A) to 1 (peptide B).The computation of ��GAB can be transformedto one of computing the difference between twosingle �G values, which in turn are individuallyevaluated over a set of intermediate values of �.

The amount of computational resourcesneeded to compute a single TI is usually verylarge. Although this varies depending on thenumber of atoms involved in the alchemical mu-tation and the size of both the bound and un-bound systems, we believe the computational costhas been the primary reason for the limited adop-tion of this technique.15

We use a grid infrastructure similar to the onewe discussed in the TeraGyroid section—both interms of hardware and software details—to calcu-late a difference in binding FE via TI on a grid.Let’s outline the grid workflow to highlight how itdiffers from the standard application of this tech-nique. Initially, a single simulation, usually at a lowvalue of �, is launched by the scientist on an HPC

resource. He or she monitors the simulation andassesses when to spawn a second simulation withthe next value of � based on a suitably determinedconvergence criterion. Depending on the exactdetails of the problem, this could vary from track-ing a simple parameter to more complex dataanalysis. When the scientist decides to spawn anew simulation, a checkpoint is created and thenew simulation with the next value of � is startedon another (or possibly the same) HPC resource.The original parent simulation continues for asuitable duration accumulating data, and the sci-entist monitors the newly spawned simulation inexactly the same manner to assess when to spawnsubsequent � simulations. The scientist monitorsand controls each of the simulations, the numberof which is constrained by the resources availablewithin the grid. By comparison, a regular TI per-formed using MD employs a serial workflow;16

each � simulation runs to completion before thenext is launched, often on the same machine. Theacceleration over the conventional approach stemsfrom the twin abilities of computational grids tosupply increased resources and to allow the simpleand uniform interaction with multiple simulations,wherever they’re running.

Although running many simulations concur-rently in different administrative domains withoutthe use of grid technology is possible in principle,in practice, it requires an enormous effort to copewith the heterogeneity of the different computersused. The aim of grid middleware is to shield theuser from these complexities, leaving him or herfree to interact with the simulations as if they wererunning locally.

We can greatly enhance a computational solu-tion’s effectiveness and impact by reducing thetime it takes to achieve a result to the same as (orless than) the time it takes to do a physical exper-iment. We’ve previously shown17 that using amodified workflow and computational gridsmakes computing a binding energy qualitativelysimilar to the experimental time scales of two tothree days. Admittedly, we achieved this for smallsystems; we’re looking into how efficient our ap-proach is for larger systems and the effect of thespawning criteria on the net computational cost.Given the scalability of code we used, we haveevery reason to think the turnaround times oflarge models would be similar, provided the re-quired number of tightly coupled HPC resourcesis available.

Computing Free Energy Profiles To understand the mechanism of biomolecular

The acceleration over the conventional

approach stems from the twin abilities to supply

increased resources and to allow the simple and

uniform interaction with multiple simulations.

Page 8: Scientific Grid Computing: The First Generation

SEPTEMBER/OCTOBER 2005 31

translocation through a pore, researchers need theFE profiles that biomolecules such as mRNA andDNA encounter when inside protein nanopores,which, unlike the difference in FE binding, isn’tjust a single number. The time scale for DNAtranslocation is on the order of tens of microsec-onds, so simulating such long time scales for largesystems (about 300,000 atoms) is impossible withtoday’s standard MD approaches.

Jarzynski’s equation provides a powerful meansof computing the equilibrium FE difference by ap-plying nonequilibrium forces.18 A steered MD(SMD) simulation can be used to apply an externalforce, and, consequently, the physical time scale wecan simulate increases. SMD simulations thus pro-vide a natural case for applying Jarzynski’s equationto a molecular system, and hence, we refer to thecombined approach as the SMD-JE approach. Asingle, detailed, long-running simulation overphysical time scales of a few microseconds wouldcurrently require months to years on one large su-percomputer to reach a solution. SMD-JE lets usdecompose the problem into a large number (anensemble) of simulations over a coarse-grainedphysical time scale with a limited loss of detail.Thus, multiple nonequilibrium SMD-JE simula-tions of several million time steps—the equivalentof a several-nanosecond equilibrium simulation interms of computational requirements—can help usstudy processes at the microsecond time scale. Thecombined SMD-JE thus represents a novel algo-rithmic advance in computing FEs.19

In spite of algorithmic advances, the computa-tional costs of such an approach remain prohibi-tive. To the best of our knowledge, researchershave never used the SMD-JE approach on prob-lems of the size and complexity needed to addressthe transport of DNA in protein nanopores. Us-ing a grid infrastructure for this problem, in addi-tion to providing uniform access to severalmultiple replicas, facilitates effective coupling oflarge-scale computational and novel analysis tech-niques. The grid approach to SMD-JE lets re-searchers decompose the problem both by natureand design.

We’re currently working to compute the com-plete FE profile. SMD-JE is an approach based onnovel physical insight that gives us the ability to de-compose a large problem into smaller ones ofshorter duration. This becomes particularly effec-tive when used with a grid infrastructure, whichprovides an environment that enables uniform ac-cess to, as well as launching, monitoring, and steer-ing of, numerous application instances over manygeographically distributed resources. It’s unclear

whether the SMD-JE approach will let us computethe FE profile of DNA translocation through ananopore, but our ability to even attempt to ad-dress this question with problems of such size andcomplexity is a direct result of the synergy betweenstate-of-the-art high-performance platforms andadvances in grid infrastructure and middleware.

At a certain level, we’ve encountered ademo-paradox: scientists want to getdown to the business of doing science,but before they can use the grid for

routine computational science, a lot needs to bedone to make the infrastructure stable. In our ex-perience, demonstrations play a significant role inironing out the wrinkles, but by their very nature,demonstrations tend to be transient working solu-tions and thus are a distraction from the main pur-pose of doing science. The grid is there, andpeople want to use it now! Science can’t wait forgrid engineers and computer scientists to get everydetail right.

It’s here that a two-pronged approach—fast ver-sus deep—has helped us. At one level, we canwork toward a rapid utilization of available infra-structure and technology; at another, we can feedour experiences and advice to resource providerslike the US TeraGrid and UK NGS or to stan-dards bodies like the Global Grid Forum(www.ggf.org) to further longer-term researchgoals for everyone. (See upcoming, related litera-ture for more current details.20)

But if there is a single lesson we’ve learned in ourendeavors over the past few years, it’s that grid mid-dleware and associated underlying technologies canand will change. The availability of robust, exten-sible, and easy-to-use middleware is critical to sci-entific grid computing efforts in the long term.Well-designed middleware should help us intro-duce new technologies without requiring that werefactor the application code (or the application sci-entist). As application-driven switched-lightpathnetworks become more widespread, for example,the grid middleware should interact “intelligently”with the optical control plane to facilitate end-to-end optical connectivity, enabling it to become justanother resource,21 like CPU and storage. The de-velopment of such middleware will hopefully re-ceive a boost as more scientists, motivated bysuccess stories, sense the advantage of attemptingto use the grid for routine science.

AcknowledgmentsThis work is based on the efforts of many people over

Page 9: Scientific Grid Computing: The First Generation

32 COMPUTING IN SCIENCE & ENGINEERING

several years. In particular, we thank Stephen Picklesand his group at the University of Manchester for theirwork on the computational steering system and JensHarting and Philip Fowler for their work on theTeraGyroid and grid-based TI projects. We’re gratefulto the Engineering and Physical Sciences ResearchCouncil (www.epsrc.ac.uk) for funding much of thisresearch through the RealityGrid grant GR/R67699.Our work was partially supported by the US NationalScience Foundation under the National ResourceAllocations Committee (NRAC) grant MCA04N014. Weused computer resources at the PittsburghSupercomputer Center, the US National ComputationalScience Alliance, the TeraGrid, and the UK NationalGrid Service.

References1. N.T. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A Grid-En-

abled Implementation of the Message Passing Interface,” J.Parallel Distributed Computing, vol. 63, no. 5, 2003, pp.551–563.

2. J. Chin et al., “Steering in Computational Science: MesoscaleModelling and Simulation,” Contemporary Physics, vol. 44, no. 5,2003, pp. 417–432.

3. J.M. Brooke et al., “Computational Steering in RealityGrid,” Proc.UK E-Science All Hands Meeting, 2003; www.nesc.ac.uk/events/ahm2003/AHMCD/pdf/179.pdf.

4. S.M. Pickles et al., “A Practical Toolkit for Computational Steer-ing,” to be published in Philosophical Trans. Royal Soc. London A,vol. 363, no. 1833, 2005; www.pubs.royalsoc.ac.uk/philtransa.shtml.

5. M. McKeown, “OGSI::Lite—A OGSI Implementation in Perl,”2003; www.sve.man.ac.uk/Research/AtoZ/ILCT.

6. M. McKeown, “WSRF::Lite—A WSRF Implementation in Perl,”2003; http://vermont.mvc.mcc.ac.uk/WSRF-Lite.pdf.

7. S. Succi, The Lattice Boltzmann Equation for Fluid Dynamics andBeyond, Oxford Univ. Press, 2001.

8. P. Lallemand and L.-S. Luo, “Theory of the Lattice BoltzmannMethod: Dispersion, Dissipation, Isotropy, Galilean Invariance,and Stability,” Physics Rev. E, vol. 61, no. 6, 2000, pp.6546–6562.

9. J.M. Seddon and R.H. Templer, Polymorphism of Lipid-Water Sys-tems, Elsevier Science, 1995, pp. 97–160.

10. P. Gutmann, “PKI: It’s Not Dead, Just Resting,” Computer, vol.25, no. 8, 2002, pp. 41–49.

11. B. Beckles, “Removing Digital Certificates from the End User’s Ex-perience of Grid Environments,” Proc. UK E-Science All HandsMeeting, 2004.

12. P. Gutmann, “Plug-and-Play PKI: A PKI Your Mother Can Use,”Proc. 12th Usenix Security Symp., Usenix Assoc., 2003, pp.45–58.

13. G. Waksman et al., “Crystal Structure of the PhosphotyrosineRecognition Domain SH2 of v-src Complexed with Tyrosine-Phosphorylated Peptides,” Nature, vol. 358, 1992, pp.646–653.

14. A.R. Leach., Molecular Modelling: Principles and Applications, 2nded., Pearson Education, 2001.

15. C. Chipot and D.A. Pearlman, “Free Energy Calculations, TheLong And Winding Gilded Road,” Molecular Simulation, vol. 28,2002, pp. 1–12.

16. W. Yang, R. Bitetti-Putzer, and M. Karplus, “Free Energy Simula-

tions: Use of Reverse Cumulative Averaging to Determine theEquilibrated Region and the Time Required for Convergence,” J.Chemical Physics, vol. 120, no. 6, 2004, pp. 2618–2628.

17. P. Fowler, S. Jha, and P. Coveney, “Grid-Based Steered Thermo-dynamic Integration Accelerates the Calculation of Binding FreeEnergies,” to be published in Philosophical Trans. Royal Soc. Lon-don A, vol. 363, no. 1833, 2005; www.pubs.royalsoc.ac.uk/philtransa.shtml.

18. C. Jarzynski, “Nonequilibrium Equality for Free Energy Differ-ences,” Physical Rev. Letters, vol. 78, 1997, p. 2690.

19. S. Park et al., “Free Energy Calculation from Steered MolecularDynamics Simulations Using Jarzynski’s Equality,” J. ChemicalPhysics, vol. 119, no. 6, 2003, p. 3559.

20. P.V. Coveney, ed., “Scientific Grid Computing,” to be publishedin Philosophical Trans. Royal Soc. London A, vol. 363, no. 1833,2005; www.pubs.royalsoc.ac.uk/philtransa.shtml.

21. G. Karmous-Edwards, “Global E-Science Collaboration,” Com-puting in Science & Eng., vol. 7, no. 2, 2005, pp. 67–74.

Peter V. Coveney is a professor in physical chemistryand director of the Centre for Computational Scienceat University College London, where he is also an hon-orary professor of computer science. His research in-terests include theoretical and computational science;atomistic, mesoscale, and multiscale modeling; statis-tical mechanics; high-performance computing; and vi-sualization. Coveney has a DPhil in chemistry from theUniversity of Oxford. He is a fellow of the Royal Societyof Chemistry and the Institute of Physics. Contact himat [email protected].

Jonathan Chin is an engineering and physical sci-ences council research fellow at the Centre for Com-putational Sciences at University College London. Hisresearch interests include mesoscale modeling, visu-alization, and grid and high-performance computing.Chin has an MSci in physics from the University ofOxford. Contact him at [email protected].

Shantenu Jha is a postdoctoral research fellow at theCentre for Computational Science at University Col-lege London. His research interests include grid com-puting and computational physics. Jha has degreesin computer science and physics from Syracuse Uni-versity and the Indian Institute of Technology, Delhi.He is a member of the Global Grid Forum and theAmerican Physical Society. Contact him at [email protected].

Matthew Harvey is an engineering and physical sci-ences council postdoctoral research fellow at theCentre for Computational Science at University Col-lege London. His research interests include grid com-puting and combinatorial chemistry. Harvey hasdegrees in astrophysics and information technologyfrom University College London. Contact him [email protected].