13
Journal of Grid Computing (2005) 3: 39–51 © Springer 2005 DOI: 10.1007/s10723-005-5464-y Scientific Data Management in a Grid Environment H.A. James and K.A. Hawick Institute of Information and Mathematical Sciences, Massey University, Albany, North Shore 102-904, Auckland, New Zealand E-mail: {h.a.james,k.a.hawick}@massey.ac.nz Key words: data management, data mining, Grid systems, metadata, parameter cross-products Abstract Managing scientific data is by no means a trivial task even in a single site environment with a small number of researchers involved. We discuss some issues concerned with posing well-specified experiments in terms of para- meters or instrument settings and the metadata framework that arises from doing so. We are particularly interested in parallel computer simulation experiments, where very large quantities of warehouse-able data are involved, run in a multi-site Grid environment. We consider SQL databases and other framework technologies for manipulating experimental data. Our framework manages the outputs from parallel runs that arise from large cross-products of parameter combinations. Considerable useful experiment planning and analysis can be done with the sparse metadata without fully expanding the parameter cross-products. Extra value can be obtained from simulation output that can subsequently be data-mined. We have particular interests in running large scale Monte Carlo physics model simulations. Finding ourselves overwhelmed by the problems of managing data and compute resources, we have built a prototype tool using Java and MySQL that addresses these issues. We use this example to discuss type-space management and other fundamental ideas for implementing a laboratory information management system. 1. Introduction A common modus operandi for computational sci- entists running numerical simulations is shown in Figure 1. A numerical model for the phenomena un- der study is constructed. The model is initialised and is spun-up into a realistic or at least representative state, whereupon measurements can be taken. De- pending upon the model involved, measurements are made from static configurations, which may be stored separately, or measurements are made as part of the evolutionary process of taking the model configuration from one state to another. These configurations can usefully be warehoused for later mining, as discussed in, for example [29]. Some important examples include numerical mod- els for weather and climate study [6], where a set of model variables such as atmospheric temperature, pressure and wind velocity are time-evolved from one configuration to the next, to predict how real weather systems will develop. Climate study is similar except that the time scales simulated are much longer and the model granularity generally coarser. Other models in computational physics and engineering studies fall into this general pattern of operation. Some examples we consider in this paper (Section 3) are Monte Carlo lattice models [7]; stochastic network models [8]; and artificial life growth models [9]. The field of Laboratory Information Management Systems (LIMS) is quite mature in the field of the chemical, pharmaceutical and life sciences. For ex- ample, there exist web sites such as LIMSource [13] that provide information on LIMS products such as BlazeLIMS [3], Sapphire [11] and STARLIMS [24] for scientists and IT managers. Aside from being targeted towards the “wet” sciences and not at acad- emic simulations management, they do not seem to be “Grid-aware” applications. In running models that have even a few separate parameters it is necessary to manage the range and combinations of parameters. Sometimes the (compu- tational) cost of running a model is small and it is

Scientific Data Management in a Grid Environment

Embed Size (px)

Citation preview

Page 1: Scientific Data Management in a Grid Environment

Journal of Grid Computing (2005) 3: 39–51 © Springer 2005DOI: 10.1007/s10723-005-5464-y

Scientific Data Management in a Grid Environment

H.A. James and K.A. HawickInstitute of Information and Mathematical Sciences, Massey University, Albany, North Shore 102-904, Auckland,New ZealandE-mail: {h.a.james,k.a.hawick}@massey.ac.nz

Key words: data management, data mining, Grid systems, metadata, parameter cross-products

Abstract

Managing scientific data is by no means a trivial task even in a single site environment with a small number ofresearchers involved. We discuss some issues concerned with posing well-specified experiments in terms of para-meters or instrument settings and the metadata framework that arises from doing so. We are particularly interestedin parallel computer simulation experiments, where very large quantities of warehouse-able data are involved, runin a multi-site Grid environment. We consider SQL databases and other framework technologies for manipulatingexperimental data. Our framework manages the outputs from parallel runs that arise from large cross-productsof parameter combinations. Considerable useful experiment planning and analysis can be done with the sparsemetadata without fully expanding the parameter cross-products. Extra value can be obtained from simulation outputthat can subsequently be data-mined. We have particular interests in running large scale Monte Carlo physics modelsimulations. Finding ourselves overwhelmed by the problems of managing data and compute resources, we havebuilt a prototype tool using Java and MySQL that addresses these issues. We use this example to discuss type-spacemanagement and other fundamental ideas for implementing a laboratory information management system.

1. Introduction

A common modus operandi for computational sci-entists running numerical simulations is shown inFigure 1. A numerical model for the phenomena un-der study is constructed. The model is initialised andis spun-up into a realistic or at least representativestate, whereupon measurements can be taken. De-pending upon the model involved, measurements aremade from static configurations, which may be storedseparately, or measurements are made as part of theevolutionary process of taking the model configurationfrom one state to another. These configurations canusefully be warehoused for later mining, as discussedin, for example [29].

Some important examples include numerical mod-els for weather and climate study [6], where a setof model variables such as atmospheric temperature,pressure and wind velocity are time-evolved from oneconfiguration to the next, to predict how real weathersystems will develop. Climate study is similar except

that the time scales simulated are much longer andthe model granularity generally coarser. Other modelsin computational physics and engineering studies fallinto this general pattern of operation. Some exampleswe consider in this paper (Section 3) are Monte Carlolattice models [7]; stochastic network models [8]; andartificial life growth models [9].

The field of Laboratory Information ManagementSystems (LIMS) is quite mature in the field of thechemical, pharmaceutical and life sciences. For ex-ample, there exist web sites such as LIMSource [13]that provide information on LIMS products such asBlazeLIMS [3], Sapphire [11] and STARLIMS [24]for scientists and IT managers. Aside from beingtargeted towards the “wet” sciences and not at acad-emic simulations management, they do not seem to be“Grid-aware” applications.

In running models that have even a few separateparameters it is necessary to manage the range andcombinations of parameters. Sometimes the (compu-tational) cost of running a model is small and it is

Page 2: Scientific Data Management in a Grid Environment

40

Figure 1. Flow (from left to right) of archive-able data from simulation runs in a common pattern for numerical experiments. Many long-runningsimulation systems are designed to either repeat performing a limited number of simulation steps and saving the output before starting again,or save their state in a checkpoint format that can be used to re-start a failed program.

feasible to throw away the configuration outputs andjust preserve the few measurements that are madeduring the run. It is sometimes however either too ex-pensive to be able to justify re-running models withthe same parameters or in some cases it is impor-tant for legal or other operational reasons to keepall model output in an archive. Ideally the workingcomputational scientist would like to afford the stor-age capacity to preserve the output from all past runsfor possible future further analysis or for bootstrap-ping new model runs. There are tradeoff costs for thisstorage that must be weighed against the computa-tional cost of regenerating data from runs. However,above the physical cost of storage media are “total costof archiving” issues that need to be considered moredeeply.

We explore one important complexity contributionto the total cost of archiving. Managing data that issimulated from codes that are continually evolving ina way which is forward-compatible is nontrivial. Theoutput formats are likely to differ slightly as the codesevolve. One step towards this is to consider the cross-products of all possible parameters values that couldbe used and to explore the implications of labellingexperimental run outputs by these parameter values.Another step is to use a robust textual tagging of outputvalues that offers some some defence against chang-ing output formats through the use of generalised dataextraction utilities and scripts.

Consider a simulation with just two parameters, asshown in Figure 2. We can imagine the computationalscientist steering this simulation does not systemati-cally explore all of the possible parameter space, butrather poses some preliminary experiments that ex-plore some combinations of parameter values that spansome area of parameter space he expects have some“interesting” properties. Parameter one is representedby rows, and parameter two by columns. Representedby dots in the figure, the scientist has carried out runsover the parts of parameter space shown and subse-quently wishes to keep track of the “runs” obtained.Run data may be reused later and to have “archivevalue” it must be easily retrievable. An important pointis that the number of parameters explored in an ex-periment may be far fewer than the possible numberof parameters that would be explored by a brute-forceapproach.

It has been our own experience that scientists oftenuse some ad-hoc approach to keeping track of modeloutput data often involving file or directory names,perhaps with some README files to describe what ex-periments have been done. This is analogous to theonline “lab notebook” – easy to use for very smallsets of runs but rapidly becomes hard to manage foreven just a three-parameter experiment. A commonsituation for our running stochastic simulations is thatwe rapidly accumulate new parameters as we developthe model. Two main ones accrue from the core model

Page 3: Scientific Data Management in a Grid Environment

41

Figure 2. Sparse cross-product of a two-parameter system. A simple two-parameter system where various experiments have been carried outyielding non-null entries in the data base or matrix of possible values. The extreme values of the parameters and the discretisation scheme haveset the bounds and size of the matrix table. The outer product need not (and likely will not) be fully populated. The dots represent those known(or explored) parameter combinations, which are likely to number far less than the potential number of possible parameter combinations. Inthis example P1 and P2 are just integers mapping directly to their indices.

itself, a further one is the sample number if we areaveraging over many different stochastic simulationsequences, and a fourth is the random number genera-tor seed if we wish to keep track of separate repeatablestochastic sequences. It is common to rely on meta-data [30] in the form of long file names or sometimesin terms of headers stored in the configuration filesthemselves. These can, however, be opaque to thebrowsing scientist planning a follow-on experiment.

In this paper we explore how relational databasetechnology can be augmented with some simple con-ventions and easily produced tools to help managemore complex batches of runs and archives of data.

2. A Virtual Spreadsheet

Consider the parameters available in a multi-dimensional virtual spreadsheet. We have a hyper-space of parameter values that is sparse, as not allcombinations of parameters are necessarily deemedworth running, and of course the actual values usedover a range will be limited. We want the scientist tobe able to pose the questions:− What have I run already? and− What can I now run to pose a new research

question, making best use of my existing data sets?We want the “housekeeping” operations for keepingtrack of the data to be as automated as possible, whilestill being compatible with the simulation programsalready in use. We also want new programs to be ableto easily access previous data.

One classic approach to these problems is to writesimulation programs that log all sorts of extra infor-mation and measurements as well as the parametersactually used. The log files can then be scanned forrelevant information and often new questions can beposed using old data that was generated before thequestion had been thought of. We have often adoptedconventions for logging data to support this. Forinstance a textual tag is invented to describe the nu-merical measurement in question and the output valueis prefixed with this tag in the log file. Standard Unixtools such as grep, cut and paste and other textmanipulation programs or languages such as perl orpython can readily be combined into scripts to ex-tract relevant values from an archive of log files, insuitable form for plotting for example. Another com-mon technique is to encode some parameter values inthe filenames or in the names of the directories. Thiscan aid casual browsing and experiment planning upto a point, but rapidly becomes cumbersome whenmany values are involved. How can the scientist viewand visualise the sparse hyper-space of parameters,and hence assess existing data availability and planexperiments?

Imagine a virtual spreadsheet that supports lookingat any two axes from the hyper-dimensional data set.We would like the virtual spreadsheet tool to allowspecification of these axes and to cope with what isa sparse set of data that may not be stored online orlocally. The posed queries or high level commandscan, in principle, be organised by the virtual spread-sheet tool into the necessary data retrieval requests to

Page 4: Scientific Data Management in a Grid Environment

42

be scheduled and the resulting “plot” assembled. Ide-ally the tool would have enough information to at leastestimate how long satisfying the request will take, if itis not in “interactive time”.

3. Application Examples

A recent real experiment of interest to us involvedthe simulation of the Ising model under Small-Worldconditions, as described in [7, 8]. Small-World sys-tems employ spatial shortcuts in the lattice or graphof sites. They typically cause dramatic changes in thebehaviour of a model. The Small-World Ising systemwe are investigating displays a marked shift in its crit-ical temperature when short-cuts are introduced, andwe are systematically measuring the dependence ofthis shift on the short cut probability of introductionparameter. This whole experiment involves a care-ful and computationally demanding investigation overselected regions of the model parameter space. Themodel uses Markov-Chain Monte Carlo (MCMC) [15]sampling of a data space. There are four major para-meters that we wish to track, as well as at least twoextra parameters arising from the experiment that canbe significant. These parameters are shown in Table 1.Some of the parameters combine to give other proper-ties, such as: the total number of points in the systemis calculated by the number of lattice points per di-mension raised to the power of the dimensionality.This number can then be multiplied by the proba-bility of Small-World shortcuts on the lattice to givethe number of lattice points that need to be modified.Recording the random number seed ensures that any

experiment we perform is reproducible, an importantconsideration in any scientific investigation.

Our simulation was originally developed by choos-ing an initial temperature T and shortcut probability p,and then refining the values as the simulation producedresults. When we had identified a promising (or “in-teresting”) area of the T × p × L (number of latticesites) parameter space we performed a production runusing the local supercomputer cluster. Each simulationtook approximately 28 hours to complete 11 millionupdate steps. By the end of the study we had producedapproximately 0.7TB of data. The problem was organ-ising it and searching through it in an efficient manner.Our first approach was to write a series of Unix shell-scripts. We used long-named data files, including mostof the relevant parameters as either part of the filename or at least as a comment-style metadata withinthe file. The scripts were a good ad-hoc solution –they enabled us to do basic searching and sorting onthe data, and allowed us to execute our custom-writtenanalysis programs on the data files. However they werenot good at helping us identify “holes” in the data. Itis particularly difficult to develop statistical analysisscripts that can cope with gaps in the data space.

Parallel supercomputers and clusters are oftenshared resources (Figure 3). Our is no exception, beingshared between computer scientists, computationalchemists and computational biologists; it is not underour direct control. We observed that sometimes due toqueue failures or lack of scratch space our simulationswould either not start, crash upon startup, crash part-way through the simulation, or simply freeze – andnot make any progress. The latter condition was lateridentified as a transient hard-drive problem in a small

Table 1. An enumeration of the relevant parameters in our Small-World Ising model sim-ulations. The abbreviation used to refer to each parameter is given in parenthesis after theparameter’s description. Some parameters are marked as keys, signifying they are crucialquantities to keep track of; in practice we use these parameters to define the data table’sprimary key. For other parameters, such as seed, min, max and stride has no real meaning– as the random number generator will use a random seed for each run.

Parameter description Key? Min Max Stride

Dimensionality of study (d) Yes 1 5 1

No. lattice points per dimension (L) Yes 1 1024 1

Temperature of system (T ) Yes 4.000 5.000 10−3

Prob. of Small-World short-cut (p) Yes 0 1 10−9

No. update steps in data file (steps) No 1 99999999 1

Random number generator seed (s) No 0 99999999 1

Update method (u) No 0 1 1

Page 5: Scientific Data Management in a Grid Environment

43

Figure 3. Screen capture showing our Small-World Ising program input and outputs. Note this example shows the program run with 500 stepsbetween saving states; in the experiments we discuss in this paper values of 50,000 and 500,000 were used.

number of the compute nodes. To compound matters,we were using the resulting configurations from previ-ous runs as a starting point for subsequent runs in thesame parameter space. When a “gap” was identifiedthis meant that before we could start the next iterationof that configuration the gap would have to be filled.Because of the number of jobs that were being createdas the T ×p×L product, it was very difficult to spot asingle job that had failed. A series of scripts writtento help with this task proved to be large and quiteunwieldy. An integrated approach allows us to har-vest data from the distributed nodes directly into thedatabase – or even to keep track of data that remainsdistributed amongst nodes’ local disks. We believethese sorts of operational problems are quite commonamongst computational scientists.

Our data analysis programs also require explo-ration of different parts of the parameter space wemeasured. For example, some require the first seriesof data collected for each of the different values ofthe parameters’ cross-product, while others require allthe series for a given value of the parameters’ cross-product in a single sequence. It has been quite difficultmaintaining the scripts necessary to extract all the re-quired data in a portable manner and to cope with theexception handling routines for dealing with missingdata.

We also run our own programs which simulatead-hoc network structures [7, 8] and Artificial Life(ALife) [9] predator–prey models for studying speciesevolution. The ad-hoc network simulation involvedthe variation of four distinct parameters: the numberof radio transmitter sites in the simulation, the radiusof perception of each transmitter, the type and degreeto which the network was perturbed by Small-Worldeffects, and the random number seed for configura-tion. A parameterised study was performed using awide-range of parameter values; not all possible pa-rameter values between the minimum and maximumwere used. Searching the database of values meansthat it is not necessary to compute the complete para-meter cross-product. We had the aim in the experimentof steering the experiment while it was in progress,through the modification of parameters, to investigateinteresting phenomena.

The ALife simulation uses six independent para-meters to represent predator and prey birth rates, theirlongevity, evolutionary periods and a random num-ber seed for configuration information. This modelwas particularly interesting as we had parallelised it.The simulation was implemented as a parallel programusing an optimal number of 42 processors. Configura-tions had to be consistently stored to obtain meaning-ful statistics on the experimental runs. When analysing

Page 6: Scientific Data Management in a Grid Environment

44

the results of this experiment we had to ensure that in-dividual processors’ output logs were safely archivedto prevent data being lost, particularly should subse-quent manual intervention be required for exceptionhandling.

In summary, these simulations require a mix ofinteger and floating-point parameters, and have po-tentially large parameter spaces to explore. Gaps inthe run-sequences are difficult to cope with for sta-tistical analysis purposes without a good managementframework.

4. Framework Architecture

Our prototype scientific data management frameworkis based around the use of a MySQL database [19] anda Java driver program using the Java Database Con-nection (JDBC) package. We have defined the tablestructure in a way that we hope will lend itself to beingable to represent many different types of simulationsystems. Each simulation system will have its own ex-perimental parameters. The parameters that have beendefined for our Ising experiment are shown in Table 1.The database not only records the metadata about theparameters such as their description and minimum andmaximum values, but also the values of the parametersthat have actually been used in real experiments.

Our control scripts have been modified to test forsuccessful experiment completion. In the case of ourIsing model a simple example of this is to ensure thenumber of lines on the output log file is precisely 10million, and that there exists a final configuration file.After the experiment is deemed to be successful thefile is moved to a known (standard) directory and therelevant data on the experiment run is inserted into thedatabase. A typical output log from our Ising experi-ment is 50 MB in size, which needs to be subsequentlyanalysed. Instead of copying the actual data into thedatabase (perhaps as a Binary Large Object – BLOB)in the prototype we insert the absolute path name tothe file in the database.

We recognised very early that our scientific dataarchive will likely consist of experiments with a sparsecross-product of parameters as shown in Figure 2.Using a series of SELECT statements allows us to deter-mine whether all the required data is in the database. Italso allows us to select only sequences of data that weare interested in for our data analysis programs. Whilethe ability for a tool to compute the necessary cross-product of a parameter set and initiate experiments is

Figure 4. Layered software stack diagram showing indirect accessto the filesystem via the database.

not new, c.f. Abramson’s Nimrod toolkit [1], as weconceive of new data ranges and analysis techniquesthat we would like to investigate, we are able to play“what if” games and interrogate the database as towhat data is already in the database, and what willneed to be generated. In the case that more data mayneed to be generated, the database queries are able tooutput the specific values that are required to generateexecution scripts and then schedule the jobs.

In future versions of the framework we hope to ex-tend our graphical user interface to easily allow noviceusers to interrogate the system, using some of the ideasmentioned in Section 2. The underlying tools used bythe first version of our prototype uses a programmerinterface and a library of interface routines that can belinked with our simulation programs written in C++ orJava.

Our management famework can be used with anyprogram that allows parameters to be specified on thecommand line, and produces output files, either namedexplicitly or captured through the unix stdout stream.Programs do not require re-compilation to use theframework.

We recognise that many application scientists areuncomfortable with having to remember complicateddatabase access routines. To alleviate the need forusers to remember the routines we introduce a newLibrary layer to shield the user. This library layer isshown in Figure 4. The library actually serves two pur-poses. The first is to shield the users from the database.The second is actually to protect the database from theusers.

A significant problem could arise if the files thatstore the raw experimental data are moved, deleted,or renamed. We have considered a number of op-tions, including creating a file system space that is only

Page 7: Scientific Data Management in a Grid Environment

45

writable by programs using the database as an accesscontrol mechanism, but as yet this issue is unresolved.We can also write our own “safe” versions of the cpand mv programs that update the database as any datafiles are moved. These new programs would work inconjunction with the library layer to ensure the data-base’s consistency in the face of different file-systemoperations.

In an idea similar to that reported in [21] we areattempting to “fool” the application program into be-lieving that separate data files actually exist where theyare actually pseudo-files accessed by the underlyingdatabase and manipulated by our database access rou-tines. The application and user write to files in thenormal way, but the calls are intercepted by system-level library functions that access the database ratherthan the normal file-system.

5. Database Implementation

In our current prototype we have identified a numberof issues that we have not yet been able to adequatelyresolve. These include: the difficulty in assigning dataranges; ensuring links to data files remain valid; andcomputing the cross-product of a variable number ofparameters. These points are discussed below.

A naive implementation of parameter spaces, asused above, is useful when a simple range of parame-ter values is required. It works simply because mostparameters to our modelling experiments are continu-ous variables that we can give a reasonable delta (orsmallest change amount). A problem arises when wewish to represent a data range which is not sampled ateven intervals. An example of this is when the scientistwishes to have their values evenly spaced on a log–log

graph: the anti-log values are not evenly spaced. Theonly real solution to this is either to specify a formulafor the calculation of values in the required range, oralternatively to enumerate the entire set of valid val-ues. Our current solution to the problem of un-evenparameter values has been to define a ListParameterclass that simply maintains a list of valid values; forvariables able to take every value in the range of[min,max] we use a RangeParameter class.

At present we represent all numerical parame-ters as a fixed-length (Java) string and perform in-crement/decrement operations using the fixed lengthstrings. This is done for two reasons: firstly when ourdata analysis programs iterate over the raw data filesthey have a consistent representation of the data value,

and secondly we wish to specify a value’s precisionunambiguously: we don’t want to have any roundingeffects (due to the floating point representation) creep-ing into the system. Thus we will not end up with asituation that 2.00000000001 is stored instead of 2.0due to rounding errors and machine epsilons.

We often wish to inspect the database to find outwhat data is present and what is missing. This in-volves a cross-product of the parameters used to recordthe data. A large problem is then efficiently and al-gorithmically computing cross-products of variablenumbers of parameters. The number of elements in across-product increases exponentially with each newparameter added to the vector. The traditional methodfor enumerating the cross-products is via nested loops.In the situation that the number of parameters is notknown in advance, the enumeration is difficult toachieve other than by brute force.

We have defined a ParameterArray class that canbe used to group together different ListParametersand RangeParameters. The array can be iterated over,producing each element of the cross-product in theparameter range, for an arbitrary number of parame-ters. The major benefit of this implementation is thatit is not necessary to evaluate every member of thecross-product if every member is not required.

6. Tools

In this section we describe the virtual spreadsheet toolwe have prototyped using the Java Swing [25] graph-ical user interface (GUI) library components and theJDBC package wrapping around a MySQL database.Our tool was built to meet our pragmatic need to man-age large numbers of simulations that have been run onvarious cluster computer resources over a six monthtime period.

The JDBC technology for interfacing to relationaldatabases is well established and need not be describedhere. In summary, various Java classes and methodswrap around the database server in a client-server soft-ware model. The Java Swing JTable has provided thebasis for our virtual spreadsheet GUI and deservessome comment. It provides standard GUI widget be-haviours for a tableau of edit-able cells that has higherrun-time performance than would a simple array ofseparate text-field widgets. Its basis is an interfacespecifying method signatures for accessing, editingand counting the cells in the tableau. We map a two-dimensional tableau to a cut through our hyper-brick

Page 8: Scientific Data Management in a Grid Environment

46

Figure 5. A view of a JTable showing edit-able cells and a statistical report generated from them.

of parameters. The power of being able to constructpartial cross-products is that we need not expand themin full, and that we can manipulate entire swathes ofmetadata visually.

Figure 5 shows some of the capabilities of oursystem. A view is being generated of some deriveddata from a simulation. It is suitably sorted and dis-played as a sheet report, with some simple statisticalmeasurements also shown. Generally for the sort ofexperiments we report here (such as the Ising sim-ulations) statistics are not generated directly withinthe tool but are accumulated as output from separateanalysis programs. It is not uncommon for speciallyoptimised analysis programs to be written for stud-ies of this sort, and the management tool needs to beflexible enough to accommodate these.

Our Ising experiment uses a cross-product of para-meter values given by:

p × T × d × L × s × u × steps, (1)

where each parameter is described in Table 1. We en-visage that an “experiment” will generally representa body of work such as we describe. Namely, oneor more simulation programs that may have separateversions; a set of adjustable input parameters and aresulting collection of output files that will typically

occupy considerable disk space. The main aim of ourtool is to support the design and operation of numericalexperiments like this, so that better use can be made ofcompute resources and of previously-generated data.

The classical modus operandi that we and manycolleagues use is, having created a simulation pro-gram, to write various shell scripts that generate jobruns and to use these to generate output files. Outputis often organised rather simply either with parame-ters embedded in the filenames or sometimes in thenames of sub directories. Recognising this, our tool isdesigned to import metadata in the form of path infor-mation – file and directory names. Regular expressionutilities such as are provided by the Java String classare useful for parsing prior filename metadata. It isperfectly satisfactory for the tool to manage the filesin their existing naming scheme, providing this fileand path information can be left fixed and stored asindirect addresses in the database.

Generally the post-run analysis of simulations suchas we describe is much too computationally inten-sive and the data sets too large to load it all up intoa conventional spreadsheet program. A tool like wedescribe is needed to manage the subsequent analysisprocessing runs that will let loose a highly optimisedstatistical analysis sub-program across the bulk datafiles.

Page 9: Scientific Data Management in a Grid Environment

47

Figure 6. A view from our tool showing summary statistical information of the parameter ranges in our experiments. Note the statistics arederived only from metadata parameters, not from the millions of measured data values the metadata indirectly addresses.

In the case of our Ising runs described above, eachoutput data set consists of 2 or 3 measured values ateach of upwards of 11 million steps. Part of the exper-iment is to use different statistical analysis techniquesto pass over the data constructing statistics and corre-lation functions, and other values derived from them.At present our data set consists of over 9 posed ex-periments, each with its own parameter cross-productsand typically more than 150 output files per experi-ment, each file or set of files storing 11 million datatriplets.

Having collected this data and used each experi-ment to attack a particular research question, it is prov-ing valuable to subsequently mine the data base forother trends and correlations. New research questionshave arisen from close examination of the outcomesof the simpler experiments. The tool helps highlightgaps in the existing parameter space coverage of theproblem and helps plan subsequent runs. Working ona simple proportional depreciation value model forour supercomputer cluster, the existing data cost ap-proximately NZ$50k to generate. It is therefore veryworthwhile to mine it for maximal research value andalso to make optimal use of further compute resourcecommitted to the project.

Figures 6 and 7 show screen-dumps of our exper-iment planning tool. Figure 6 shows the summary ofstatistical information computed from the database’sparameter records. It is supremely useful for us to beable to define parameters, such as T (temperature ofthe model) with large ranges – shown as 4.000 to 5.000– but only perform experiments on subsets of thoseranges. In this illustration, the actual range of parame-ters used to store data in this database is 4.500 to 4.519inclusive. This also means that when we create a cross-product of this parameter with other parameters in the

model, we have the choice of being able to use either:(i) the complete range of the parameter; (ii) the rangeof actual values in the database for this parameter, or(iii) another range, which may overlap with the actualvalues in the database.

The statistics shown in Figure 6 are generated fromthe metadata and do not represent an analysis of themeasurements from the raw data files to which themetadata is only an indirect guide. In experiments likethis one there are various phase transitions involvedso it is nontrivial a priori to estimate sensible para-meter values for T and p. Indeed finding the phasetransition values in these two parameters is one ofthe experiment’s goals. The planning tool is there-fore valuable to guide progress and having carried outa preliminary experiment to scan coarsely in T × p

space, a finer-grained scan can be carried out sub-sequently. Management of the prior data means thatis is not wasted and that new measurements can beprogressively interleaved with old ones.

Figure 7 shows the output of the program whenthe user chooses two parameters to display as hori-zontal and vertical axes. The major parameters of theexperiment under consideration are d = {3} × L ={40, 44, 48}×T = {4.500 to 4.519}×p = {0.0 to 0.1in log steps}. The user has selected to view T × p forthis experiment. Each cell in the virtual spreadsheetrepresents a hyper-block of the remaining parameters’values. The tool can be adjusted to show various sum-mary information for each cell. In the screen-dump itshows a count of the number of records in the databasecorresponding to the particular value of T × p. Otheroptions include: a colour highlight for missing data; anestimate of the resource time used so far/required to fillin “holes”; the number of elements in the remainingcross-product component.

Page 10: Scientific Data Management in a Grid Environment

48

Figure 7. A view from our tool showing a partial cross-product of parameters for the Ising model experiments. Rows are for parameter T andcolumns for parameter p. The value in each cell is the number of distinct records in the database corresponding to the particular values ofT and p. Note that some cell values are lower than others, c.f. the highlighted cell, meaning that there are fewer records pertaining to thoseparameters. This could be caused by fewer data points being investigated in that parameter space or failed simulation runs.

Figure 7 shows some cells with a much smallercount value than the majority. These represent holesin the data. This view is effectively an automaticallygenerated version of the phenomenon shown in Fig-ure 2. The available data is sparse over this parametercross-product range – either deliberately or by acci-dent – which can occur if supercomputer job runs fail.Our management tool therefore helps us extract valuefrom what may be an imperfect incomplete set of runs,without the difficulties of hand-editing analysis jobscripts.

7. Types and Associated Issues

Our tool as described in Section 6 was developedspecifically to address our Ising experiments. Thereare some interesting issues concerned with generalis-ing it for other experiments with out having to recodeit entirely. These are concerned with data typing andintrospection issues.

We envisage the general case whereupon an experi-ment is designed with some number Np of parameters.

Each of these parameters Pi, i = 1, 2, . . . , Np mayhave specific type information. As we discuss in Sec-tion 5 it is convenient to use a fixed length string asthe storage container for manipulating both integerand floating point data in our database and inside themanagement tool. For most of the sort of numericalsimulation work we envisage the two simple data types“double” and “int” are sufficient, but even the fact thatwe must distinguish between these two poses a prob-lem. The problem of presenting a data model that issuitable for large-scale scientific data is also discussedin [27].

The JTable widget from the Java Swing libraryutilises a generalised object model for cells, and itis a matter for the application developer to pack andunpack objects into the actual data types used by theprogram. We are considering a flat list of well known(simple) data types that can be specified when theuser designs an experiment. Objects which are ferriedaround as fixed-length strings are then effectively in-trospected and treated as their appropriate simple type.This model is sufficient for this sort of tool where wedo not concern ourselves with compound types suchas arrays, lists or collections of the simpler types.

Page 11: Scientific Data Management in a Grid Environment

49

Our design strategy is that the tool itself copes withcompound types through its parameter cross-products’data structures. It is not trivial to see how to tacklewhat would otherwise become a combinatorial explo-sion of possible (compound) data types which wouldneed to be supported.

Imposing constraints on the parameters presentssimilar type-based issues. For example some of ourIsing parameters are unconstrained doubles, some areconstrained to be positive only. Some integers such asour “dimension” parameter are constrained to smallpositive values. For the sorts of experiment we en-visage a short flat list of applicable constraints ismanageable. These can be hard-coded and enabledappropriately. We are considering how a generalised“super-type” object could also contain constraint in-formation. At present our tool copes with this issueby using what are effectively enumerated types in theform of explicit lists of allowable values for each pa-rameter. This is feasible in the context of a particularexperiment such as the Ising model where although wewould ideally like to be able to sample a large rangeof double precision parameter values, in practice weare limited by computational and storage feasibility torelatively short lists (around 100 members at most).

8. Scientific Data on the Grid

Understandably there has been much interest in man-aging large data sets across Grid-sized clusters whichmay be spread across multiple administrative domains,for example in the application domains describedin [20]. While we do not claim to have solved all theapplicable issues in this paper, we have made somein-roads into the desirable properties of such a scien-tific data management system and produced a simpleprototype. Articles such as [2, 16] discuss the realneed for such data management tools and frameworksacross the Grid [5]. [23] discusses the need and pos-sible design paradigms for storage resource managerson computational Grids and data Grids, but no toolsfor actually manipulating the large amounts of data aresuggested.

Many individual Grid projects feature a mod-ule dedicated to the management of scientific data,for example the EU Data Grid [26] and the NERCDataGrid [12]. The Spitfire module in the EU DataGrid, based on Globus services, allows Grid-enabledaccess to any database, which can be used to searchfor data and metadata. A component of the NERC

Data Grid’s Delivery Service allows users to selectweb pages via a web page interface.

The LOFAR/LOIS project [22] features a high-performance Grid database manager that utilisesmassively-replicated object-oriented databases acrossa Grid system to achieve fault tolerance and also highperformance. It is designed to allow manipulation oflarge-scale radar and physics data across the SwedishGrid testbed. While the technologies described in thispaper are of considerable importance, we note thelack of tools to allow users to navigate through thescientific data contained in the machines on the Grid.

The CLADE project [17] uses scientific contentmanagement services developed within the US De-partment of Energy called Scientific Annotation Mid-dleware (SAM) [18], which provides the capability tostore, retrieve and search data and associated meta-data across a distributed environment. The CLADEproject has been designed as a general framework toenable datasets from different applications to be ma-nipulated using a standard set of web services. Likemany projects, it does not provide the facilities forvisualising large hyper-bricks of scientific informationand steering the associated computations.

Other packages, for example SimTracker [14], al-low simulations to be tracked from submission tocompletion, including a visualisations of the computeddata, but again do not allow large hyper-bricks ofsoftware to be manipulated en-masse.

The closest project to the work reported here isthe Virtual Instrument [4], which has been used for acomputational biology application, MCell. The majordifference between the Virtual Instrument work andour work is the ability for our prototype to make use ofsparse data sets and parameter space matrices in orderto reduce the possibly exponential number of differentvalues in a parameter cross-product.

We wish to emphasise the purpose of this workis not to completely re-invent a Grid framework thatwould be used to manage long-running simulationsystems; we propose a layered approach that can beused in-conjunction with existing Grid systems andoff-the-shelf technologies such as scripting shells anddatabase managers to automate and fail-safe the re-quired tasks. For example, there are several existingand planned components of the Globus toolkit [28]that one could quite readily use to aid the man-agement of complex numerical simulations, such asthe Globus Resource Allocation and Management(GRAM) service and the Monitoring and DiscoverySystem (MDS4).

Page 12: Scientific Data Management in a Grid Environment

50

9. Summary and Conclusions

We have identified a common operational pattern forscientific experiments – and which is particularly com-mon for numerical simulation experiments. We havedescribed some of the problems facing a computa-tional scientist managing “runs”, their measurements,and the resulting configuration files. We have de-scribed how ad-hoc solutions can be augmented usingcommonly available public domain software tools, andhow a “Computational Laboratory Information Sys-tem” can be based around a database. Our prototypeand the ideas arising from it can be usefully appliedto situations where large amounts of data are gener-ated and must be curated. We believe the tools wedescribe here can be used to construct a system thatis capable of coping with quite large repositories, butwhich is also open enough that distributed componentscan be readily added to cope with collaborative Gridenvironments.

We have described the sparse data structure thatarises from partial cross-products of parameters intoa simulation, when the scientist does not want to ex-plore the full parameter space. We have shown that thisneed not be an obstacle to a simulation managementsystem, and that gaps in the data can be handled. Weare presently extending our prototype to include somesimple data-mining utilities that will be compatiblewith the data management system.

Some general issues have arisen from this work– specifically those concerning practical approachesto type space management and sub-type/enumeratedtype constraint management. We believe our approachand the technological solution we describe may beof use to other researchers trying to manage complexnumerical simulations.

Acknowledgements

We thank Massey University and the Allan WilsonCentre for use of “Helix” supercomputer cluster timefor the Ising simulation work reported in this paper.

References

1. D. Abramson, R. Sosic, J. Giddy and B. Hall, “Nimrod: ATool for Performing Parametised Simulations Using Distrib-uted Workstations”, in Proc. 4th IEEE Symposium on HighPerformance Distributed Computing, Virginia, August 1995.

2. G. Allen, E. Seidel and J. Shalf, “Scientific Computing on theGrid”, Byte Magazine, pp. 24–32, Spring 2002.

3. Blaze Systems Corporation, “BlazeLIMS Laboratory Infor-mation Management System”, available from http://www.blazesystems.com. Last visited November 2004.

4. H. Casanova, T. Bartol, F. Berman, A. Brinbaum, J. Dongarra,M. Ellisman, M. Faerman, E. Gockay, M. Miller, G. Obertelli,S. Pomerantz, S. Sejnowski, J. Stiles and R. Wolski, “TheVirtual Instrument: Support for Grid-enabled Scientific Sim-ulations”, Technical Report CS2002-0707, May 2002.

5. I. Foster and C. Kesselman, The Grid 2: Blueprint for a NewComputing Infrastructure, 2nd edn. Morgan Kaufmann, 2003.

6. K.A. Hawick, P.D. Coddington and H.A. James, “DistributedFrameworks and Parallel Algorithms for Processing Large-Scale Geographic Data”, Parallel Comput., Vol. 10, p. 1297,2003.

7. K.A. Hawick and H.A. James, “Ising Model Scaling Be-haviour on Small-World Networks”, Technical Note CSTN-006, March 2004, available from http://www.massey.ac.nz/~kahawick/cstn

8. K.A. Hawick and H.A. James, “Small-World Effectsin Wireless Sensor Networks”, Technical Note CSTN-001, March 2004, available from http://www.massey.ac.nz/~kahawick/cstn

9. H.A. James, C.J. Scogings and K.A. Hawick, “A Frameworkand Simulation Engine for Studying Artificial Life”, Res. Lett.in the Information and Mathematical Sciences, Vol. 6, May2004.

10. Joint Astronomy Center, “Intelligent Agents and Robotic Tele-scopes to Help Astronomers Keep up with the Universe”, 14October 2003, available from http://outreach.jach.hawaii.edu/pressroom/2003-estar/

11. LabVantage Solutions, Inc. “Sapphire Laboratory InformationManagement System”, available from http://www.labvantage.com. Last visited November 2004.

12. B.N. Lawrence, R. Cramer, M. Gutierrez, K. Kleese van Dam,S. Kondapalli, S. Latham, R. Lowry, K. O’Neill and A. Woolf,“The NERC DataGrid Prototype”, in S.J. Cox (ed.), Proc. U.K.e-Science All Hands Meeting, 2003.

13. LIMSource, “LIMSource: LIMS Resource on the Internet”,available from http://www.limsource.com. Last visited No-vember 2004.

14. J. Long, P. Spencer and R. Springmeyer, “Simtracker – Usingthe Web to Track Computer Simulation Results”, in Proc.1999 International Conference on Web-Based Modeling andSimulation, San Francisco, CA. Proceedings available as Sim-ulation Series, Vol. 31, No. 3, from The Society for ComputerSimulation.

15. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth,A.H. Teller and E. Teller, “Equation of State Calculations byFast Computing Machines”, J. Chem. Phys., Vol. 21, No. 6,pp. 1087–1092, June 1953.

16. R.W. Moore, Data Management Services, updated versionof Data Management Systems for Scientific Applications,The Architecture of Scientific Software, Academic Publishers,2001.

17. J.D. Myers, T.C. Allison, S. Bittner, B. Didier, M. Frenklach,W.H. Green, Jr., Y.-L. Ho, J. Hewson, W. Koegler, C. Lansing,D. Leahy, M. Lee, R. McCoy, M. Minkoff, S. Nijsure, G. vonLaszewski, D. Montoya, C. Pancerella, R. Pinzon, W. Pitz,L.A. Rahn, B. Ruscic, K. Schuchardt, E. Stephan, A. Wag-ner, T. Windus and C. Yang, “A Collaborative InformaticsInfrastructure for Multi-scale Science”, in Proc. Challengesof Large Applications in Distributed Environments (CLADE)Workshop, Honolulu, HI, 7 June 2004, pp. 24–33.

Page 13: Scientific Data Management in a Grid Environment

51

18. J.D. Myers, A. Chappell, M. Elder, A. Geist and J. Schwid-der, “Re-Integrating the Research Record”, IEEE Computingin Science and Engineering, Vol. 5, No. 3, pp. 44–50, 2003.

19. MySQL, MySQL Database homepage, available from http://www.mysql.com. last visited July 2004.

20. Particle Physics Data Grid (PPDG) Website, http://www.ppdg.net/, The Earth System Grid (ESG) Website, http://www.earthsystemgrid.org/, The National Fusion Grid Website, http://www.fusiongrid.org/projects/, The Collaboratory for Multi-scale Chemical Science Website, http://cmcs.org/. Last visitedNovember 2004.

21. C.J. Patten, F.A. Vaughan, K.A. Hawick and A.L. Brown,“DWorFS: File System Support for Legacy Applications inDISCWorld”, in Proc. 5th IDEA Workshop, Fremantle, Feb-ruary 1998.

22. Risch, T., Koparanova, M. and Thidé, B.: “High-performanceGRID Database Manager for Scientific Data”, in Proc. 4thDistributed Data and Structures, WDAS’02, Carleton Scien-tific: Paris, France, pp. 99–106, 2002.

23. A. Shoshani, A. Sim and J. Gu, “Storage Resource Man-agers: Middleware Components for Grid Storage”, in Proc.19th IEEE Symposium on Mass Storage Systems (MSS’02),2002.

24. STARLIMS Corporation, “STARLIMS Laboratory Infor-mation Management System”, available from http://www.starlims.com. Last visited November 2004.

25. Sun Microsystems, Inc., “Java Foundation Classes(JFC/Swing) Web Page”, available from http://java.sun.com/products/jfc/index.jsp. Last visited November 2004.

26. The European DataGrid Project Team, “The DataGridProject”, available from http://www.eu-datagrid.org. Last vis-ited November 2004.

27. L.A. Treinish, “Scientific Data Models for Large-Scale Appli-cations”, IBM Technical Report, available from http://www.research.ibm.com/people/l/lloydt/dm/. Last visited November2004.

28. University of Chicago, “Globus Toolkit”, available from http://www.globus.org. Last visited January 2005.

29. K.-Y. Whang and R. Krishnamurthy, “The Multilevel Grid File– A Dynamic Hierarchical Multidimensional File Structure”,in Proc. 2nd International Symposium on Database Systemsfor Advanced Applications, Advanced Database Research andDevelopment Series, Vol. 2, pp. 449–459, 1991.

30. World Wide Web Consortium (W3C), “Metadata at W3C”,available from http://www.w3.org/Metadata/. Last visited July2004.