Scientific Data Management in a Grid Environment

  • Published on

  • View

  • Download

Embed Size (px)


  • Journal of Grid Computing (2005) 3: 3951 Springer 2005DOI: 10.1007/s10723-005-5464-y

    Scientific Data Management in a Grid Environment

    H.A. James and K.A. HawickInstitute of Information and Mathematical Sciences, Massey University, Albany, North Shore 102-904, Auckland,New ZealandE-mail: {h.a.james,k.a.hawick}

    Key words: data management, data mining, Grid systems, metadata, parameter cross-products


    Managing scientific data is by no means a trivial task even in a single site environment with a small number ofresearchers involved. We discuss some issues concerned with posing well-specified experiments in terms of para-meters or instrument settings and the metadata framework that arises from doing so. We are particularly interestedin parallel computer simulation experiments, where very large quantities of warehouse-able data are involved, runin a multi-site Grid environment. We consider SQL databases and other framework technologies for manipulatingexperimental data. Our framework manages the outputs from parallel runs that arise from large cross-productsof parameter combinations. Considerable useful experiment planning and analysis can be done with the sparsemetadata without fully expanding the parameter cross-products. Extra value can be obtained from simulation outputthat can subsequently be data-mined. We have particular interests in running large scale Monte Carlo physics modelsimulations. Finding ourselves overwhelmed by the problems of managing data and compute resources, we havebuilt a prototype tool using Java and MySQL that addresses these issues. We use this example to discuss type-spacemanagement and other fundamental ideas for implementing a laboratory information management system.

    1. Introduction

    A common modus operandi for computational sci-entists running numerical simulations is shown inFigure 1. A numerical model for the phenomena un-der study is constructed. The model is initialised andis spun-up into a realistic or at least representativestate, whereupon measurements can be taken. De-pending upon the model involved, measurements aremade from static configurations, which may be storedseparately, or measurements are made as part of theevolutionary process of taking the model configurationfrom one state to another. These configurations canusefully be warehoused for later mining, as discussedin, for example [29].

    Some important examples include numerical mod-els for weather and climate study [6], where a setof model variables such as atmospheric temperature,pressure and wind velocity are time-evolved from oneconfiguration to the next, to predict how real weathersystems will develop. Climate study is similar except

    that the time scales simulated are much longer andthe model granularity generally coarser. Other modelsin computational physics and engineering studies fallinto this general pattern of operation. Some exampleswe consider in this paper (Section 3) are Monte Carlolattice models [7]; stochastic network models [8]; andartificial life growth models [9].

    The field of Laboratory Information ManagementSystems (LIMS) is quite mature in the field of thechemical, pharmaceutical and life sciences. For ex-ample, there exist web sites such as LIMSource [13]that provide information on LIMS products such asBlazeLIMS [3], Sapphire [11] and STARLIMS [24]for scientists and IT managers. Aside from beingtargeted towards the wet sciences and not at acad-emic simulations management, they do not seem to beGrid-aware applications.

    In running models that have even a few separateparameters it is necessary to manage the range andcombinations of parameters. Sometimes the (compu-tational) cost of running a model is small and it is

  • 40

    Figure 1. Flow (from left to right) of archive-able data from simulation runs in a common pattern for numerical experiments. Many long-runningsimulation systems are designed to either repeat performing a limited number of simulation steps and saving the output before starting again,or save their state in a checkpoint format that can be used to re-start a failed program.

    feasible to throw away the configuration outputs andjust preserve the few measurements that are madeduring the run. It is sometimes however either too ex-pensive to be able to justify re-running models withthe same parameters or in some cases it is impor-tant for legal or other operational reasons to keepall model output in an archive. Ideally the workingcomputational scientist would like to afford the stor-age capacity to preserve the output from all past runsfor possible future further analysis or for bootstrap-ping new model runs. There are tradeoff costs for thisstorage that must be weighed against the computa-tional cost of regenerating data from runs. However,above the physical cost of storage media are total costof archiving issues that need to be considered moredeeply.

    We explore one important complexity contributionto the total cost of archiving. Managing data that issimulated from codes that are continually evolving ina way which is forward-compatible is nontrivial. Theoutput formats are likely to differ slightly as the codesevolve. One step towards this is to consider the cross-products of all possible parameters values that couldbe used and to explore the implications of labellingexperimental run outputs by these parameter values.Another step is to use a robust textual tagging of outputvalues that offers some some defence against chang-ing output formats through the use of generalised dataextraction utilities and scripts.

    Consider a simulation with just two parameters, asshown in Figure 2. We can imagine the computationalscientist steering this simulation does not systemati-cally explore all of the possible parameter space, butrather poses some preliminary experiments that ex-plore some combinations of parameter values that spansome area of parameter space he expects have someinteresting properties. Parameter one is representedby rows, and parameter two by columns. Representedby dots in the figure, the scientist has carried out runsover the parts of parameter space shown and subse-quently wishes to keep track of the runs obtained.Run data may be reused later and to have archivevalue it must be easily retrievable. An important pointis that the number of parameters explored in an ex-periment may be far fewer than the possible numberof parameters that would be explored by a brute-forceapproach.

    It has been our own experience that scientists oftenuse some ad-hoc approach to keeping track of modeloutput data often involving file or directory names,perhaps with some README files to describe what ex-periments have been done. This is analogous to theonline lab notebook easy to use for very smallsets of runs but rapidly becomes hard to manage foreven just a three-parameter experiment. A commonsituation for our running stochastic simulations is thatwe rapidly accumulate new parameters as we developthe model. Two main ones accrue from the core model

  • 41

    Figure 2. Sparse cross-product of a two-parameter system. A simple two-parameter system where various experiments have been carried outyielding non-null entries in the data base or matrix of possible values. The extreme values of the parameters and the discretisation scheme haveset the bounds and size of the matrix table. The outer product need not (and likely will not) be fully populated. The dots represent those known(or explored) parameter combinations, which are likely to number far less than the potential number of possible parameter combinations. Inthis example P1 and P2 are just integers mapping directly to their indices.

    itself, a further one is the sample number if we areaveraging over many different stochastic simulationsequences, and a fourth is the random number genera-tor seed if we wish to keep track of separate repeatablestochastic sequences. It is common to rely on meta-data [30] in the form of long file names or sometimesin terms of headers stored in the configuration filesthemselves. These can, however, be opaque to thebrowsing scientist planning a follow-on experiment.

    In this paper we explore how relational databasetechnology can be augmented with some simple con-ventions and easily produced tools to help managemore complex batches of runs and archives of data.

    2. A Virtual Spreadsheet

    Consider the parameters available in a multi-dimensional virtual spreadsheet. We have a hyper-space of parameter values that is sparse, as not allcombinations of parameters are necessarily deemedworth running, and of course the actual values usedover a range will be limited. We want the scientist tobe able to pose the questions: What have I run already? and What can I now run to pose a new research

    question, making best use of my existing data sets?We want the housekeeping operations for keepingtrack of the data to be as automated as possible, whilestill being compatible with the simulation programsalready in use. We also want new programs to be ableto easily access previous data.

    One classic approach to these problems is to writesimulation programs that log all sorts of extra infor-mation and measurements as well as the parametersactually used. The log files can then be scanned forrelevant information and often new questions can beposed using old data that was generated before thequestion had been thought of. We have often adoptedconventions for logging data to support this. Forinstance a textual tag is invented to describe the nu-merical measurement in question and the output valueis prefixed with this tag in the log file. Standard Unixtools such as grep, cut and paste and other textmanipulation programs or languages such as perl orpython can readily be combined into scripts to ex-tract relevant values from an archive of log files, insuitable form for plotting for example. Another com-mon technique is to encode some parameter values inthe filenames or in the names of the directories. Thiscan aid casual browsing and experiment planning upto a point, but rapidly becomes cumbersome whenmany values are involved. How can the scientist viewand visualise the sparse hyper-space of parameters,and hence assess existing data availability and planexperiments?

    Imagine a virtual spreadsheet that supports lookingat any two axes from the hyper-dimensional data set.We would like the virtual spreadsheet tool to allowspecification of these axes and to cope with what isa sparse set of data that may not be stored online orlocally. The posed queries or high level commandscan, in principle, be organised by the virtual spread-sheet tool into the necessary data retrieval requests to

  • 42

    be scheduled and the resulting plot assembled. Ide-ally the tool would have enough information to at leastestimate how long satisfying the request will take, if itis not in interactive time.

    3. Application Examples

    A recent real experiment of interest to us involvedthe simulation of the Ising model under Small-Worldconditions, as described in [7, 8]. Small-World sys-tems employ spatial shortcuts in the lattice or graphof sites. They typically cause dramatic changes in thebehaviour of a model. The Small-World Ising systemwe are investigating displays a marked shift in its crit-ical temperature when short-cuts are introduced, andwe are systematically measuring the dependence ofthis shift on the short cut probability of introductionparameter. This whole experiment involves a care-ful and computationally demanding investigation overselected regions of the model parameter space. Themodel uses Markov-Chain Monte Carlo (MCMC) [15]sampling of a data space. There are four major para-meters that we wish to track, as well as at least twoextra parameters arising from the experiment that canbe significant. These parameters are shown in Table 1.Some of the parameters combine to give other proper-ties, such as: the total number of points in the systemis calculated by the number of lattice points per di-mension raised to the power of the dimensionality.This number can then be multiplied by the proba-bility of Small-World shortcuts on the lattice to givethe number of lattice points that need to be modified.Recording the random number seed ensures that any

    experiment we perform is reproducible, an importantconsideration in any scientific investigation.

    Our simulation was originally developed by choos-ing an initial temperature T and shortcut probability p,and then refining the values as the simulation producedresults. When we had identified a promising (or in-teresting) area of the T p L (number of latticesites) parameter space we performed a production runusing the local supercomputer cluster. Each simulationtook approximately 28 hours to complete 11 millionupdate steps. By the end of the study we had producedapproximately 0.7TB of data. The problem was organ-ising it and searching through it in an efficient manner.Our first approach was to write a series of Unix shell-scripts. We used long-named data files, including mostof the relevant parameters as either part of the filename or at least as a comment-style metadata withinthe file. The scripts were a good ad-hoc solution they enabled us to do basic searching and sorting onthe data, and allowed us to execute our custom-writtenanalysis programs on the data files. However they werenot good at helping us identify holes in the data. Itis particularly difficult to develop statistical analysisscripts that can cope with gaps in the data space.

    Parallel supercomputers and clusters are oftenshared resources (Figure 3). Our is no exception, beingshared between computer scientists, computationalchemists and computational biologists; it is not underour direct control. We observed that sometimes due toqueue failures or lack of scratch space our simulationswould either not start, crash upon startup, crash part-way through the simulation, or simply freeze andnot make any progress. The latter condition was lateridentified as a transient hard-drive problem in a small

    Table 1. An enumeration of the relevant parameters in our Small-World Ising model sim-ulations. The abbreviation used to refer to each parameter is given in parenthesis after theparameters description. Some parameters are marked as keys, signifying they are crucialquantities to keep track of; in practice we use these parameters to define the data tablesprimary key. For other parameters, such as seed, min, max and stride has no real meaning as the random number generator will use a random seed for each run.

    Parameter description Key? Min Max Stride

    Dimensionality of study (d) Yes 1 5 1No. lattice points per dimension (L) Yes 1 1024 1Temperature of system (T ) Yes 4.000 5.000 103Prob. of Small-World short-cut (p) Yes 0 1 109No. update steps in data file (steps) No 1 99999999 1Random number generator seed (s) No 0 99999999 1Update method (u) No 0 1 1

  • 43

    Figure 3. Screen capture showing our Small-World Ising program input and outputs. Note this example shows the program run with 500 stepsbetween saving states;...


View more >