27
Grid resource management for data mining applications Valentin Kravtsov a Thomas Niessen b Assaf Schuster a Werner Dubitzky c,1 Vlado Stankovski d,* a Technion Israel Institute of Technology, Haifa, Israel b Fraunhofer Institute for Intelligent Analysis and Information Systems, Bonn, Germany c University of Ulster, Biomedical Sciences Research Institute, Coleraine, UK d University of Ljubljana, Ljubljana, Slovenia Abstract Emerging data mining applications in science, engineering and other sectors in- creasingly exploit large and distributed data sources as well as computationally intensive algorithms. Adapting such applications to grid computing environments has implications for grid resource brokering. A grid resource broker supporting such applications needs to provide effective and efficient job scheduling, execution and monitoring. Furthermore, to be useable by domain-oriented end users and to be able to evolve gracefully with emerging grid technology, it should hide the underlying complexity of the grid from such users and be compliant with important grid stan- dards and technology. The DataMiningGrid Resource Broker was designed to meet these requirements. This paper presents the DataMiningGrid Resource Broker and the results from evaluating it in a European-wide test bed. Key words: Grid, resource broker, data mining, GridBus * Corresponding author. Email addresses: svali [email protected] (Valentin Kravtsov), [email protected] (Werner Dubitzky), [email protected] (Vlado Stankovski). 1 We acknowledge the cooperation of all DataMiningGrid partners and collabora- tors in the DataMiningGrid project. This work was supported largely by the Eu- ropean Commission FP6 grant DataMiningGrid (www.DataMiningGrid.org), Con- tract No. 004475. Preprint submitted to Elsevier 17 November 2006

Grid resource management for data mining applications

  • Upload
    tommy96

  • View
    516

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Grid resource management for data mining applications

Grid resource management for data mining

applications

Valentin Kravtsov a Thomas Niessen b Assaf Schuster a

Werner Dubitzky c,1 Vlado Stankovski d,∗

aTechnion Israel Institute of Technology, Haifa, IsraelbFraunhofer Institute for Intelligent Analysis and Information Systems, Bonn,

GermanycUniversity of Ulster, Biomedical Sciences Research Institute, Coleraine, UK

dUniversity of Ljubljana, Ljubljana, Slovenia

Abstract

Emerging data mining applications in science, engineering and other sectors in-creasingly exploit large and distributed data sources as well as computationallyintensive algorithms. Adapting such applications to grid computing environmentshas implications for grid resource brokering. A grid resource broker supporting suchapplications needs to provide effective and efficient job scheduling, execution andmonitoring. Furthermore, to be useable by domain-oriented end users and to be ableto evolve gracefully with emerging grid technology, it should hide the underlyingcomplexity of the grid from such users and be compliant with important grid stan-dards and technology. The DataMiningGrid Resource Broker was designed to meetthese requirements. This paper presents the DataMiningGrid Resource Broker andthe results from evaluating it in a European-wide test bed.

Key words: Grid, resource broker, data mining, GridBus

∗ Corresponding author.Email addresses: svali [email protected] (Valentin Kravtsov),

[email protected] (Werner Dubitzky), [email protected](Vlado Stankovski).1 We acknowledge the cooperation of all DataMiningGrid partners and collabora-tors in the DataMiningGrid project. This work was supported largely by the Eu-ropean Commission FP6 grant DataMiningGrid (www.DataMiningGrid.org), Con-tract No. 004475.

Preprint submitted to Elsevier 17 November 2006

Page 2: Grid resource management for data mining applications

1 Introduction

Due to the increased computerization of many industrial, scientific, and publicsectors, the growth of available digital data is proceeding at an unprecedentedrate. The effective and efficient management and use of stored data, and inparticular the transformation of these data into information and knowledge,is considered a key requirement for success in such domains. Data mining [1](a.k.a. knowledge discovery in databases) is the de-facto technology address-ing this information need. Until a few years ago, data mining has mainlybeen concerned with small to moderately sized data sets within the context oflargely homogeneous and localized computing environments. These assump-tions are no longer met in modern scientific and industrial complex-problemsolving environments, which are more and more relying on the sharing of geo-graphically dispersed computing resources. This shift to large-scale distributedcomputing has profound implications in terms of the way data are analyzed.Future data mining applications will need to operate on massive data sets andagainst the backdrop of complex domain knowledge. The domain knowledge(computer-based and human-based), the data sets themselves, and the pro-grams for processing, analyzing, evaluating, and visualizing the data, and otherrelevant resources will increasingly reside at geographically distributed siteson heterogeneous infrastructures and platforms. Grid computing promises tobecome an essential technology capable of addressing the changing computingrequirements of future distributed data mining environments.

As a result of these developments, grid-enabled and distributed data mininghas become an active area of research and development in recent years [2–4].This new area attempts to meet the requirements of emerging and future datamining applications by exploiting and sharing computational resources avail-able in grid computing environments. Critical hardware and software resourcesto be shared in such applications include primary and secondary storage de-vices, processing units, data and data mining application software. Facilitat-ing effective and efficient sharing of such resources across local and wide areacomputing networks in the context of modern data mining applications is nottrivial. This is particularly the case in grid computing environments spanningmultiple administrative domains and heterogeneous software and hardwareplatforms.

The EU-funded DataMiningGrid project [5] is a large-scale effort aimed atdeveloping a generic system facilitating the development and deployment ofgrid-enabled data mining applications. Some key requirements of this effortinclude (1) end users should be able to use the system without needing toknow details of the underlying grid technology, (2) developers should be ableto grid-enabling existing data mining applications with little or no interventionin existing application code, and (3) the system should adhere to existing and

2

Page 3: Grid resource management for data mining applications

emerging grid and grid-related standards. A critical element in such a systemis a grid resource broker [6]. Essentially, a grid resource broker examines andkeeps track of the resources and their capabilities in a grid computing envi-ronment, matches incoming application job requests and their requirementsagainst those resources, assigns the jobs to available and suitable resourcesin the grid, and initiates and monitors the execution of the jobs. This workprovides a detailed account of the DataMiningGrid Resources Broker that wasdeveloped for the DataMiningGrid system. The Resource Broker was designedwith the special attention to the requirements of data mining applications.

The remainder of the paper is organized as follows. Section 2 and 3 provide thebackground of data mining and resource broker technology. Section 4 presentsa detailed description of the requirements of the DataMiningGrid ResourceBroker. This is followed by Section 5, which reviews related work. In Sec-tion 6 the DataMiningGrid system architecture is presented and the ResourceBroker’s role and function within this architecture is explained together withdetails relating to the design and implementation of the Broker. The func-tion of the Broker is further described and illustrated in Section 7. Section8 presents some results obtained from evaluating the Resource Broker withinthe DataMiningGrid test bed. Finally, Section 9 concludes with a summaryand some critical remarks.

2 DataMining

In its typical form, data mining can be viewed as the formulation, analysis,and implementation of an induction process (proceeding from the specific to thegeneral) that facilitates the extraction of information (nontrivial, previouslyunknown, and potentially useful patterns) from data [1,7,9]. Data mining in-volves the use of software, sound methodology, and human creativity to achievenew insight through the exploration of data. The goal of data mining is to al-low the end user (e.g., scientist, engineer, marketing, retail or finance expert)to improve his or her decision-making. Compared with classical statistical ap-proaches, data mining is perhaps best seen as a process that encompasses awider range of integrated methodologies and tools including databases, mod-elling, statistics, machine learning, knowledge-based techniques, uncertaintyhandling, and visualization. Data mining is a continuous and iterative pro-cess. Typically, such a process involves the broad steps of (1) problem anddata understanding; (2) data pre-processing; (3) data analysis (this refers toalgorithms that induced patterns and rules from a collection of observations);(4) result post-processing and interpretation; and (5) actions and decisions inthe context of the domain questions to be addressed [8,9].

3

Page 4: Grid resource management for data mining applications

3 Grid resource brokering

In a grid a resource broker is defined as an entity that provides the bridgebetween resource offers and resource requests (called jobs in grid terminology)[11]. Usually, resource brokers are developed to automate the following oper-ations [11]:

• Matching resources offers to resource requests. Each job requires aset of resources, such as memory, disk space, etc., to be executed. In orderto match these resource requirements of the job to available resource in agrid, resource offers and resource requests need to be expressed in a way sothat the resource broker can understand them and carry out the matching.• Jobs scheduling. When several resource requests and offers are available,

different scheduling policies may be applied to match requests to offers. Thescheduling policy is designated to maximize some predefined utility function.Such functions may, for example, minimize the job execution latency ormaximize benefits (e.g., monetary) of the resource providers.• Staging-in of data and executables. Typically, prior to job execution,

data and executables have to be staged in, i.e., made available on the execu-tion machine(s) that have been selected by the resource broker. The resourcebroker orchestrates this process and normally needs to enforce an all-or-nonestage-in policy. Once the stage-in is completed the resource broker initiatesjob execution.• Monitoring of the execution process. Normally, the status of a job

changes during the course of its execution. A job might have reached astate in which it is pending, active, completed, failed, and so on. Thesechanges need be conveyed to the user so as to allow him or her to takeappropriate actions (e.g., retry, abort).• Results delivery (stage-out) to the user. In grid computing environ-

ments it is common that the execution machine is located outside of theadministrative domain of the job submitter. Therefore, it is important thatafter the completion of the execution, the results are transferred to somepersistent storage and that a clean-up on the execution machines is per-formed (including the removal of all stage-in files).

Data mining applications differ from applications performing other informa-tion management and processing tasks. They typically process large amountsof data while the code of the actual data mining algorithms is relatively small.This has led to bundling of many stand-alone data mining applications intoapplication suites, which share a common underlying internal representation ofthe data and graphical user interface. Usually, the individual applications areexecutable in batch mode and can be parameterized at the start of execution[12]. However, the majority of specialized data mining algorithms developedby data mining researchers do not consist of integrated application suits. The

4

Page 5: Grid resource management for data mining applications

grid-enabling of such programs poses a much wider range of varying require-ments to a resource broker.

4 Requirements for a resource broker facilitating grid-enabled datamining

A data mining application consists of the application of data mining technol-ogy to data analytical tasks required within various application domains. Cen-tral elements of any data mining application are the data to be mined, the datamining algorithm(s) to be used, and a user who specifies and controls the datamining process. Grid-enabling data mining applications is motivated by thesharing of computational resources via local and wide area networks. Existingdata mining applications may benefit from grid-enabled resource sharing withregard to improved effectiveness or efficiency or other benefits (e.g., novel use,wider access or better use of existing resources). Furthermore, grid-enabledresource sharing may actually facilitate novel data mining applications. Giventhe nature of data mining applications, four key computational resources tobe shared can be identified:

1. Data. The data to be mined in the form of electronic databases, datafiles, documents, and so on;

2. Application programs. Data mining application programs providingthe implementation of data mining algorithms used to mine the data;

3. Processors. Computing processor units providing the raw compute powerfor processing of the data; and

4. Storage. Data storage devices to physically store the input and outputdata of data mining applications.

A system whose main function it is to facilitate the sharing of such resourceswithin a grid environment supporting data mining applications should takeinto account the unique constraints and requirements of data mining applica-tions with respect to the data management and data mining software tools, andthe users of these tools. To determine and specify the detailed requirementsof the DataMiningGrid Resource Broker, we have defined a representative setof use cases. The use cases are based on real-world data mining applicationsfrom industry and science from the following areas: medicine, biology, bioin-formatics, customer relationship management and car repair log diagnosticsin the automotive industry, monitoring of network-based computer system,ecological modelling.

Following is a list of the identified requirements.

5

Page 6: Grid resource management for data mining applications

1. Fully automated resource aggregation. In order to execute the datamining tasks efficiently, the resource broker needs to automatically dis-cover all the available computational resources and filter out those, whichdo not comply with the requirements associated with the data mining ap-plication.

2. Data-oriented scheduling. The data mining application programs pro-cess large volumes of data [13]. To facilitate efficient processing of thesedata DataMiningGrid Resource Broker needs to minimize transfer of largeamounts of data across the network. Additionally, the Resource Brokerneeds to support data mining applications where data cannot be movedfor reasons other than large volume (e.g., security, privacy, legislative orregulatory reasons). This means that the Resource Broker should supporta data mining process in which the data mining algorithm or program isshipped to the data.

3. Zero-footprint on execution machines. The Resource Broker shouldbe capable to handle the execution of any data mining algorithm thatcan be run from the command-line. Therefore, the data mining programexecutables should not be required to be installed in advance on the ex-ecution machines. Instead, the Resource Broker should arrange for themto be transferred to the execution machine(s) together with the necessarylibraries and data. Furthermore, at the end of the execution process thestaged-in files need to be cleaned up, leaving the execution machine(s)with zero footprint after the execution.

4. Extensive monitoring. The Resource Broker should monitor and col-lect all job execution status information, from start of execution to com-pletion (successful or unsuccessful) and allow the user to monitor thisinformation while the execution is in process. This information shouldinclude items like error logs, execution machine address, and so on.

5. Parameter sweep support. Many data mining algorithms require arepeated execution of the same process with different parameters (con-trolling the behaviour of the algorithm) or different input data sets. Thisis typically required in optimization tasks or sensitivity analysis tasks.Therefore, the Resource Broker should provide mechanisms that supportthe automated instantiation of the variables given a set of instantiationrules.

6. Interoperability. Since grid is defined as a collection of the heteroge-neous and distributed resources, it is mandatory that the DataMining-Grid Resource Broker provides a flexible framework facilitating the ex-ecution on a wide range of execution machines. Furthermore, this func-tionality should be realized in such a way that the user or applicationdeveloper is not required to provide any wrappers or other low-level con-structs. If the executables constrains allow it to run on a certain type ofmachine, the Resource Broker should be able to launch the executable onsuch machines.

6

Page 7: Grid resource management for data mining applications

7. Adherence to interoperability standards. Even in the fast-changingarea of grid technology adherence to interoperability standards and otherstandards used by large parts of the community is crucial for buildingfuture-proof systems. Relevant standards include Open Grid Service Ar-chitecture (OGSA) [14] and Web Services Resource Framework (WSRF)[15]. OGSA is a distributed interaction and computing architecture basedon the concept of a grid computing service, assuring interoperability onheterogeneous systems so that different types of resources can communi-cate and share information. WSRF is aimed at defining a generic frame-work for modelling and accessing persistent resources using Web servicesso that the definition and implementation of a service and the integrationand management of multiple services is made easier [38]. WSRF narrowedthe meaning of grid services to those services that conform to the WSRFspecification [39], although, a broader, more natural meaning of a gridservice is still in use.

8. Adherence to security standards. As the DataMiningGrid ResourceBroker is designed to execute a wide variety of data mining applications,flexible security mechanisms need to be introduced to provide differentlevels of security for different types of users. Since data mining may beperformed on data with severe security and privacy constraints, it is nec-essary that the Resource Broker facilitates the enforcement of secure andencrypted data transport, and to include authentication and authoriza-tion mechanisms. In particular, it should follow the standards of the pub-lic key infrastructure based on X.509-compliant certificates [16], the WS-IBasic Security Profile 1.0 [17], WS-Security [18], WS-Secure-Conversation[19].

9. User friendliness. Data mining application end users and data miningexperts, while being experts in their own field, cannot be assumed to beexperts in grid technology. Thus, as many grid-related details and detailsabout the Resource Broker as possible should be hidden from these users.

5 Related work

Paying special attention to de-facto standards and grid middleware, we couldnot identify any existing off-the-shelf resource broker that meets all (or most)of the requirements specified for the DataMiningGrid Resource Broker. BothGlobus Toolkit 4 (GT4) [20] and Condor [21] were previously selected to formthe underlying grid middleware infrastructure in the DataMiningGrid systemarchitecture. Below we summarize the findings of our research into existingrelevant grid technology including resource brokers.

The GridLab Resource Management System (GRMS) [22] is a job meta-scheduling and resource management framework that allows users to build

7

Page 8: Grid resource management for data mining applications

and deploy job and resource management systems for grid environments. It isbased on dynamic resource discovery and selection, mapping and schedulingmethodologies, and it is able to deal with resource management challenges.The GRMS manages the entire process of remote job submission and controlover distributed systems. However, its strengths are fully expressed in the com-bination with the complete GridLab middleware. At the time of the design ofthe DataMiningGrid Resource Broker, the GRMS was not WSRF-compliant,and did not support any interaction with GT4. To the best of our knowledge,a parameter sweep functionality as required by several demonstrator applica-tions is also not supported by GRMS.

Enabling Grids for E-science (EGEE) [23], a large EC-funded project aim-ing to provide a worldwide seamless grid infrastructure for e-Science, uses aresource broker, which is installed on a central machine. The resource bro-ker receives requests and then decides to dispatch jobs according to systemparameters. At the time of the DataMiningGrid Resource Broker design, thelatest version of the EGEE resource broker was the LCG-2 [24] resource bro-ker. This version is not service-based (no compliance with WSRF) and doesnot support automated parameter sweep functionality. At that time the planwas to replace LCG-2 with the gLite resource broker [25].

Cactus [26] is a numerical problem-solving environment for scientists. It sup-ports data grid features using MPICH-G and Globus Toolkit. However, appli-cations in the Cactus environment have to be written in MPI, which impliesthat a legacy application cannot be easily adapted to run on a grid. Cac-tus is not WSRF-compliant and does not provide the needed data handlingfunctionality.

Nimrod-G specializes in parameter-sweep computation. However, the schedul-ing approach within Nimrod-G aims at optimizing user-supplied parameterssuch as deadline and budget for computational jobs only [27]. It does not sup-port methods for accessing remote data repositories and for optimizing datatransfer. Also, Nimrod-G does not have any mechanisms for automated re-source aggregation. To facilitate interoperability, Nimrod-G requires that jobwrapper agents be provided for new applications.

The GridBus resource broker [28] extends the Nimrod-G computational gridresource broker model to distributed data-oriented grids. The GridBus brokeralso extends Nimrod-G’s parametric modelling language by supporting dy-namic parameters, i.e., parameters whose values are determined at runtime[29]. However, the original GridBus implementation does not support auto-mated resource discovery, interoperability (allowing execution on Unix-basedmachines only), and the WSRF standard. It also lacks data movement opti-mizations needed for data mining.

8

Page 9: Grid resource management for data mining applications

6 Design and implementation

Figure 1 depicts the DataMiningGrid system architecture, including the Re-source Broker and components related to it. The architecture consists of fourdistinct layers. The top three layers contain components designed or enhancedto facilitate data mining applications in highly dynamic and heterogeneousgrid environments. The components developed by the DataMiningGrid con-sortium are highlighted in red in the diagram. Additional components provid-ing and supporting typical data mining process and functions are highlightedin grey. These may include support for managing provenance information, foraccess to heterogeneous database systems, and sophisticated tools permittingthe visualization of results. The components do not directly interface withthe Resource Broker nor do they provide any relevant information to the jobsubmission process. Since the focus of this study is on the Resource Broker,these components are not discussed here. To better understand the role andfunction of the DataMiningGrid Resource Broker in the system, the DataMin-ingGrid system architecture and some of its components are now described insome details.

6.1 Client layer

The highest level of the DataMiningGrid system represents the different clients,which serve as user interfaces to the grid by interfacing with the WSRF-compliant services located in the high-level service layer. Depending on thelevel of expertise of the different user groups (in terms of grid technology, datamining and domain-specific technologies), the architecture supports differenttypes of clients, including general-purpose workflow editors, Web portals witha minimal set of options, or applications that integrate access to the grid sys-tem into their native user interface. Typically, each of these clients providesa graphical user interface to access and control one or more of the followingfunctions:

1. Searching in the grid for available applications according to user definedcriteria such as application name, vendor, version, and type of data min-ing function provided by the application (e.g., feature selection, classifi-cation, clustering, association mining, text mining, and so on). For usersto be able to access and use a data mining application, the applicationneeds to reside on servers that are permanently connected to the grid.

2. Specification of parameter values and input data for the selected datamining application. Once the user has provided this information, anXML-coded job description document is created and stored locally forresubmission before it is passed to the Resource Broker for interpreta-

9

Page 10: Grid resource management for data mining applications

tion.3. Initiation of the execution of the data mining application in the grid.

Upon this initiation, the Resource Broker reads and interprets the jobdescription document and triggers the execution of the jobs.

While the first two functions may be omitted, for instance, when providing aclient for executing invariant but re-occurring tasks, the third is compulsoryfor every type of client. For example, in the DataMiningGrid system specialunits for the Triana workflow editor [30], implementing the full set of functionslisted above, were developed as well as a Web portal serving as a job submissioninterface accepting only pre-configured job descriptions. Also located in theclient layer is a component for monitoring all jobs submitted to the ResourceBroker by a client.

6.2 Grid middleware layer

The grid middleware layer contains all components, which are included inGT4 (green). These provide basic functionality including security mechanisms,high-speed, file-based data transport (GridFTP/RFT), and the grid-wide reg-istry Monitoring and Discovery System 4 (WS-MDS) [31]. The latter imple-ments a distributed in-memory XML database that stores information aboutavailable resources contained in the fabric layer, jobs currently being processedby the system, and available data mining applications. This information canbe retrieved using standard XPath queries [32].

The Grid Resource Allocation and Management Service (WS-GRAM) [33](which is responsible for submitting, monitoring, and cancelling jobs) man-ages the execution of applications (i.e., jobs) on a particular computationalresource through its adapter mechanism. These adapters may either executejobs on the local machine, where GT is installed (using a C-like fork command)or pass jobs to a local task scheduling system, which then schedules the jobson the machines it controls. In its current version, the DataMiningGrid sys-tem uses Condor [21] as a local scheduler for managing clusters. However, theoriginal GT4-Condor adapter lacks the capabilities for transferring completedirectory structures and executing Java applications. The current Condor im-plementation (version 6.7) and, as a result, the standard GT4 Condor adapterrestrict data movement to copying files only. While this problem needs to beaddressed by the Condor development team, we work around it by compress-ing the recursive directory structures into a single archive (i.e., single file),moving the archive to the execution machine, and extracting its content be-fore the actual execution of the data mining application. For executing Javaapplications, we extended the original GT4-Condor adaptor to handle param-eters regarding the Java virtual machine and the class path. The original GT4

10

Page 11: Grid resource management for data mining applications

Fork adapter also lacks the ability to execute Java applications. Its modifica-tions are very similar to the changes we made to the standard GT4-Condoradapter.

As the grid system outlined here is based on GT4, it also offers the same secu-rity mechanisms such as public key infrastructure based on X.509-compliantcertificates, SAML Authorization Decision support, and encryption of all net-work communication including messages between Web services and data trans-fer.

6.3 Grid fabric layer

The lowest layer represents the grid resources such as data mining applica-tions available in the grid, data, CPUs, storage, networks, and clusters (pink,blue). As discussed before, the latter are controlled by local schedulers suchas Condor. These resources are accessed only by the grid middleware.

6.4 High-level services layer

The different clients and monitoring components interface with the Informa-tion Integrator service for application discovery and management and with theResource Broker for job execution and monitoring. The Resource Broker wasnot implemented completely from scratch, but is based on the GridBus GridService Broker and Scheduler version 2.4 [29,34]. The following considerationsmotivated this choice:

1. The GridBus resource broker is capable of submitting jobs to the execu-tion subsystem (WS-GRAM) of GT4 (as well as to many other resourcesand grid systems, e.g., Alchemi, Unicore, XGrid, and others).

2. The GridBus broker’s architecture is clearly structured and well designedfrom a software engineering point of view.

3. Unlike many other resource brokers, GridBus does not require any par-ticular information or security system. It is designed as a stand-alonesoftware, ready to be integrated with various existing components andsystems.

While offering a solid basis to base the DataMiningGrid Resource Brokeron, the GridBus broker in its original version does not meet some criticalrequirements specified for the DataMiningGrid Resource Broker:

1. GridBus v2.4 is not service-oriented, but needs to be installed on everyclient machine. Thus, it does not adhere to recent grid standards such as

11

Page 12: Grid resource management for data mining applications

WSRF and OGSA.2. It does not provide mechanisms for automated resource aggregation,

but requires the user provide this function. Such tasks require exten-sive knowledge about the grids topology and its internal representations.Typically, this task cannot be performed by users who are not experts ingrid technology.

3. It supports job execution on Linux/Unix-based machines only. This con-tradicts some of the basic requirements of grid systems, which are in-tended to support and interoperate with heterogeneous hardware andsoftware (including operating systems) computing environments.

The GridBus v2.4 implementation was modified to fit into the service-orientedarchitecture of the DataMiningGrid system by wrapping it as a WSRF-compliantservice. The resulting Resource Broker service exposes its main features throughsimple public interfaces. It was further enhanced to query the MDS4 automat-ically in order to obtain the set of available resources. To match the resourcecapabilities to job requirements, the Resource Broker requires a job descrip-tion document passed to it by a client component from the upper layer (Figure1) for each individual job. This XML-based document consists of two parts:

• A non-changeable application description containing all invariant attributesof the respective application (e.g., system architecture, location of the exe-cutable and libraries, programming language). These attributes cannot bealtered by users of the system, but are typically specified by the applicationdeveloper during the process of publishing the application in the grid.• Modifiable values, which are provided by end users before or during runtime

(e.g., application parameter values, data input, additional requirements) ofthe application. These are entered using one of the graphical user interfaceclients from the client layer (Figure 1).

From the resulting job description the Resource Broker evaluates various typesof information for resource aggregation.

Static resource requirements regarding system architecture and op-erating system. Applications implemented in a hardware-dependent lan-guage (e.g., C) typically run only on the system architecture and operatingsystem they have been compiled for (e.g., PowerPC or Intel Itanium runningLinux). For this reason, the Resource Broker has to select execution machinesthat offer the same system architecture and operating system as required bythe application.

Modifiable resource requirements: memory and disk space. Whiledata mining applications may require a minimal amount of memory and diskspace a start-up time, memory and disk space demands typically rise with theamount of data being processed and with the solution space being explored.

12

Page 13: Grid resource management for data mining applications

Therefore, end users are allowed to specify these requirements in accordancewith the data volume to be processed and their knowledge of the application’sbehaviour. The Resource Broker will take into account these user-defined re-quirements and match them to those machines and resources that meet them.

Modifiable requirements: identity of machines. In some cases end usersmay generally wish to limit the list of possible execution machines based onpersonal preferences, for instance, when processing sensitive data. To supportthis requirement, it is possible for the user to specify the IPs of such machinesin the job description. Such a list causes the Resource Broker to match onlythose resources and machines listed and to ignore all other machines indepen-dent of their capabilities.

The total number of jobs. Instead of specifying single values for eachoption and data input that the selected application requires, it is also possibleto declare a list of distinct values (e.g., true, false) or a loop (e.g., from 0.50 to10.00 with step 0.25). These represent rules for variable instantiations, whichare translated into a number of jobs with different parameters by the ResourceBroker. This is referred to as a multi-job. As a result, the Broker will prefercomputational resources that are capable of executing the whole list of jobs atonce in order to minimize data transfer. Typically, such resources are eitherclusters or high-performance machines offering many distinct processors. Asan example, if the user specifies two input files (a.txt, b.txt) for the samedata input and two loops running from 1 to 10 with step 1 as parameters fortwo options, the Resource Broker will translate this into two hundred (2 x10 x 10) distinct jobs. If no singe resource capable of executing them at onceis available, the Broker will distribute these jobs over those resources thatprovide the highest capability.

In addition, the job description includes further information that becomesimportant at the job submission stage. This information is briefly describedbelow:

• Instructions on where the application executables are stored, including allrequired libraries, and how to start the selected applications. These arerequired for transferring applications to execution machines across the grid,which is part of the stage-in process discussed in more detail in the followingsection. By staging in applications together with the input data dynamicallyat run-time, the system is capable of executing these applications on anysuitable machine in the grid without prior installation of the respectiveapplication.• All data inputs and data outputs that have to be transferred prior the

execution.• All option values (application parameters) that have to be passed to the

application at start-up. As the Resource Broker is capable of scheduling

13

Page 14: Grid resource management for data mining applications

applications that are started in batch-mode from a command line, it passesall option values as flag-value pairs. Here, each flag is fixed and represents asingle option. The values, however, may change for each call if a multi-jobis specified.

Finally, we enabled the Resource Broker to use machines that are not operatedunder Linux/Unix. The original implementation from GridBus wraps all appli-cations with a shell script before scheduling them for execution. Contradictingthe basic philosophy of grid computing, this prevents, for example, executionof applications on Windows-based machines. In addition, this restriction alsoproved to be in violation with the requirements of the project partners in theDataMiningGrid project [5], who use several pools of machines running underMS Windows. This issue was resolved by simply removing the creation of thiswrapper script and modifying the Broker accordingly.

7 Executing data mining applications

When a DataMiningGrid-enabled application is being executed, the ResourceBroker takes on a central role in orchestrating the relevant processes.

7.1 Matching

The DataMiningGrid system is designed to be generic system, capable of ex-ecuting any batch algorithm in a grid computing environment. In order toorchestrate the execution process, the DataMiningGrid Resource Broker relieson two sets of information to match available resources to job execution re-quests from users. The first set is represented by the user request for resourcesbased on detailed information of the application to be executed. These in-clude the number of needed CPUs and their architecture, type of operatingsystem, free memory and disk space, and so on. This information is matchedagainst the specification and status of the available resources in the grid envi-ronment. The resource managers of the available grid resources automaticallyregister their resource offers in the central information system, which basedon the WS Globus Monitoring and Discovery System (WS-MDS). Ultimately,the application requirements and resource specification data is encoded asXML-formatted docuemnts.

The matching process begins when the Resource Broker receives the request forresources (jobs) and their requirements. Upon reception of this information,the Resource Broker queries the information system for the available gridresources, and filters out those that do not meet the requirements. Possible user

14

Page 15: Grid resource management for data mining applications

requirements specifying a restriction on the execution on one or more specificWS-GRAMs are also taken into account during this matching process. Suchrestrictions are useful for working with databases or data files, whose contentcannot be moved (or is too expensive to move). After matching is completed,the matching module transfers to the scheduler a detailed list of the resourcesthat meet the requirements to execute the job execution request of a user.

7.2 Scheduling

The scheduling policy component of the DataMiningGrid Resource Broker isimplemented as a pluggable module, which can be changed on demand. Thedefault policy is to prefer those WS-GRAM services with the largest numberof computational resources. On average, this will minimize the data trans-fers, as WS-GRAMs that have sufficiently large resources to execute all thejobs of single multi-job submission will require only one stage-in process. TheResource Broker was designed on the basis of a number of assumptions re-garding the execution environment and its purpose. The assumptions can besummarized as follows: (a) the preferred policy is to minimize the executionlatency for each user (all the jobs must be completed as soon as possible); (b)failure of single jobs should not cancel the execution of the remaining jobs;(c) job execution time is, on average, longer than stage-in time; and (d) datamovement is expensive. Using these assumptions and rules, the scheduling al-gorithm implemented by the DataMiningGrid Resource Broker is in Algorithm1.

The rationale for the scheduling algorithm is to address the assumptions andrequirements discussed above. The Resource Broker prepares the collectionof available WS-GRAMs and sorts the collection in descending order by thenumber of free CPUs. The WS-GRAM with the highest capacity is selectedfirst in order to reduce, on average, the number of stage-in procedures. Stage-in, being an expensive procedure, is performed once per WS-GRAM and notper job. If the selected WS-GRAM does not have sufficient capacity to executeall jobs, the WS-GRAM with the next highest capacity is selected, and so on.This behaviour is explained by assumption (c) above: It is preferred to startthe execution on a new WS-GRAM instead of waiting for the previous WS-GRAM to finish execution of the submitted jobs. The Resource Broker triesto submit all jobs as soon as it can in order to reduce the execution latencyof user’s multi-jobs submission. During the execution of the jobs, all jobs arebeing constantly monitored until the last job is completed.

15

Page 16: Grid resource management for data mining applications

7.3 Stage-in

The optimization of the stage-in process is based on the assumption that datamovement is an expensive and time-consuming process. The minimizationof data movement is achieved by performing the stage-in process once perWS-GRAM (as opposed to once per job) and each job is provided with thecomplete URI of the local data storage location. Usually, the WS-GRAM isresponsible for at least one cluster of machines. The per-WS-GRAM approachhas the advantage that the stage-in operation is performed far fewer timesthan would be necessary in a per-job approach. After the successful stage-inof all the executables data and libraries, the executable is launched and theexecution monitored until completion.

7.4 Job monitoring

As already mentioned, a single job description document may result in theexecution of thousands of jobs, which have all different job IDs but the samescheduler ID assigned to them. This scheduler ID allows tracking of a set ofjobs even if their execution is managed by different instances of the executionmanager (i.e., different GT4 installations). Figure 2 depicts the client-sidemonitoring component, which upon user request, is capable of displaying up-to-date information about the status of each job by querying the ResourceBroker with the scheduler ID.

During job execution, the Resource Broker constantly monitors all the changesin the status of each job and reports it to the user. The jobs that are detectedas failed are automatically resubmitted to another WS-GRAM for several re-tries. In parallel, WS-GRAMs that are detected as failed are removed fromthe available WS-GRAMs list and are not taking part in the rescheduling ofjobs.

7.5 Stage-out

In the execution request, the user also specifies the location of the storageserver on which he wants to save the results. At the end of the executionprocess, all results are shipped to the storage server, and the URI of thatlocation is returned to the user for further data handling. The Resource Brokeralso takes care of deleting all data transferred or generated in the courseof executing an individual job description document, including executables,libraries, input data and temporal data, thus leaving the execution machinein the same state as it was prior to execution.

16

Page 17: Grid resource management for data mining applications

8 Evaluation

The Resource Broker was tested in the DataMiningGrid test bed [5] span-ning three European countries - United Kingdom, Germany and Slovenia. Thetest bed consists of four servers with GT4 installations and three local com-putational clusters based on Condor with varying number of computationalmachines. The central GT4 server was running core GT4 services as well asthe Resource Broker and Information Integrator services. We present belowthe average results of several executions on a test bed, containing executionmachines as presented in Table 1.

In order to evaluate the Resource Broker, several data mining applicationswere tested. Performance measures of two of these applications are presentedhere. The first application uses J48 [12], which is an implementation of theC4.5 Decision Tree algorithm [40] developed by Ross Quinlan and is part ofthe open source data mining toolkit ”Weka” [41]. The second application isused for re-engineering of gene-regulatory networks from dynamic microarraydata. This algorithm, the Co-dependence Algorithm, was developed by theUniversity of Ulster and the Weihenstaphan University of Applied Sciences[35]. It is based on an evolutionary algorithm technique. The Co-dependenceAlgorithm was chosen to represent a ”long job” and the J48 Algorithm torepresent a ”short job”. The Co-dependence Algorithm’s fastest total serialrun-time on the fastest machine in the test bed was measured at 2200 secondsand the J48 Algorithm’s fastest total serial run-time on the fastest machinein the test bed was measured at 250 seconds.

8.1 Speed-up

Speed-up quantifies the reduction of elapsed time obtained by processing aconstant amount of work or load on a successfully greater number of computersor processors. Speed-up is a typical reason why users want to grid-enable datamining applications. For pure and homogeneous parallel computer scenarios,speed-up is typically defined as the ratio of serial execution time of the fastestknown serial algorithm (requiring a certain number of basic operations tosolve the problem) to the parallel execution time of the chosen algorithm[36]. However, in a grid environment heterogeneity is the norm. Therefore, theassumptions made for a homogeneous parallel computer set-up do normallynot hold in a grid.

To estimate speed-up, we determined the fastest serial run-time, tS, of a singleinstance of two algorithms on the fastest machine in the test bed and thenmeasured the parallel run-times, tP (N) for an increasing number, N , of CPUs

17

Page 18: Grid resource management for data mining applications

in the test bed based on running a fixed number, K, of instances (jobs) of thealgorithm. From this we calculate speed-up as follows:

speed-up = K · tS/tP (N) (1)

For our speed-up experiments, we used a fixed number of 100 jobs (K = 100)and the following number of processors: N = 10, N = 50 and N = 100.The raw run-time measurements and the derived speed-up measures of bothalgorithms are depicted in Table 2 and Table 3 below. The data shows thatboth the Co-dependence Algorithm and the J48 Algorithm speed-up increaseapproximately linearly:

• Co-dependence Algorithm: speed-up ∼ 12N

• J48 Algorithm: speed-up ∼ 13N

The gap between the optimum speed-up and the achieved speed-up can beexplained by analyzing the machines, which were utilized in the experiment.For instance, let us examine the experiments with N = 100. Since the ter-mination time of the parallel execution depends on the termination time ofthe job on the slowest machine, the speed-up results would be fairly closeto optimum for the long jobs (460,000 / 4867 ≈ 95) and still acceptable forthe short jobs (57,000 / 831 ≈ 69) in a test-bed consisting of homogeneoushardware. The worse performance of the shorter jobs originates in the systemsoverhead for each submitted job, which includes submission of the job to localscheduler, propagation of the state changes, etc. As the systems overhead isapproximately constant for each job (since it does not include the stage-in/outtime that is done once per WS-GRAM), the experiments including long jobs,for which such overhead is negligible comparing to their execution time, showresults very close to the optimum speed-up.

8.2 Scale-up

In general, scalability on a distributed computing architecture is a measure ofits capability to effectively utilize an increasing number of resources such asprocessing elements [36]. Another way of formulating this is that the scalabilityof a parallel (or distributed) system is a measure of its capacity to deliverlinearly increasing speed-up with respect to the number of processors used[37]. Typically, good scalability of a distributed system is desired if the systemneeds to support more users, humans and other computers. To avoid adverseimpact on the response times of current users, the capacity of the system mustbe grown (or scaled up) in proportion to the additional load caused by moreusers. Therefore, scale-up may be defined as the ability of a greater numberof processors to accommodate a proportionally greater workload in more-or-

18

Page 19: Grid resource management for data mining applications

less constant amount of time. For grid-enabled data mining systems, scale-upbecomes critical as the number users of such a system increases. Hence, goodscale-up behaviour is a critical requirement for the DataMiningGrid ResourceBroker.

To quantify the scale-up behaviour of the DataMiningGrid Resource Broker,we carried out a series of experiments with increasing system capacity (num-ber, K, of CPUs) and system load (number, K, of short and long data miningjobs) with the algorithms and test bed described above. The data obtainedfrom these experiments is depicted in Table 4 and Table 5 below. The datashows that the scaling-up behaviour of the Resource Broker is excellent. Inthe long job (Co-dependence Algorithm) scenario the response time increasesby only 3.2% as the N and K are increased from 10 to 50 and to 100. In caseof the short job experiments (J48 Algorithm) the response time increases by19.5% as the load and capacity are stepped up. The data shows that whendealing with a relatively small number of jobs, the slowest machines have acritical negative impact on the achieved scale-up. When executing 100 or lessjobs on 100 machines, the slowest machines are the ones that increase thegap between the actual scale-up and the maximum scale-up. However, whensubmitting a large set of jobs, the influence of the small number of slowestmachines is decreasing, causing this gap to be constantly reduced.

9 Conclusions and future work

This study presents the Resource Broker development from the DataMining-Grid project [5]. The design of the Resource Broker was driven by require-ments arising from data mining applications in different sectors. A review ofexisting technology showed that no resource broker was available to addressall requirements. The DataMiningGrid Resource Broker combines the follow-ing features (Section 4): (a) fully automated resource aggregation, (b) data-oriented scheduling, (c) zero-footprint on execution machines, (d) extensivemonitoring of resources, (e) parameter sweep support, (f) interoperability, (g)adherence to interoperability standards, (h) adherence to security standards,and (i) user friendliness. Extensive performance experiments we carried outindicate that the Resource Broker shows excellent speed-up and scale-up be-haviour for jobs which run more then half an hour. Both are critical to supportmodern data mining applications.

The DataMiningGrid Resource Broker has been developed as part of a com-prehensive architecture for grid-enabling data mining applications. This ar-chitecture is designed to be highly flexible (software and hardware platforms,type of data mining applications and data mining technology), extensible(adding of system features, applications and resources), efficient (throughput

19

Page 20: Grid resource management for data mining applications

and scalability), and user-friendly (graphical and Web user interfaces, supportfor user-definable workflow, hiding of underlying complexity from user). Thedevelopment of the DataMiningGrid system architecture and its Resource Bro-ker involved an extensive study and evaluation of WSRF [15] and the GlobusToolkit [20]. Following service-oriented architecture principle, the DataMining-Grid Resource Broker supports a full, easy-to-use framework for data miningin grid computing.

Currently a large set of data mining applications are being evaluated on theDataMiningGrid system. This will help us to identify areas where additionaldevelop may be useful. In its present implementation, one Resource Broker isrequired for each virtual organization. Future work will consider a distributedimplementation, which may help to further enhance availability and load bal-ancing features.

References

[1] M.J., Berry, G. Linoff, Data Mining Techniques For Marketing, Sales andCustomer Support, John Wiley & Sons, Inc., New York, 1997.

[2] A.K.T.P. Au, V. Curcin, M. Ghanem, et al., Why grid-based data miningmatters? fighting natural disasters on the grid: from SARS to land slides, inCox S.J. (editor), UK e-science all-hands meeting, AHM 2004, Nottingham,UK, September 2004, EPSRC, 2004, Pages: 121 - 126, ISBN: 1-9044-2521-6.Also available at: . http://www.allhands.org.uk/submissions/papers/81.pdf

[3] M. Cannataro, D. Talia, P. Trunfio, Distributed data mining on the grid, FutureGeneration Computer Systems 18, 11011112, 2002

[4] W.K. Cheung, X-F. Zhang, Z-W. Luo, F.C.H. Tong, Service-OrientedDistributed Data Mining, IEEE Internet Computing, 44-54, July/August 2006.

[5] The DataMiningGrid Consortium and Project, www.DataMiningGrid.org

[6] K. Krauter, R. Buyya, M. Maheswaran, A taxonomy and survey of gridresource management systems for distributed computing, Software: Practiceand Experience, Vol. 32, No. 2, 135-164, 2001.

[7] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining. MIT Press,Cambridge, MA, 2001.

[8] C. Shearer, The CRISP-DM Model: The New Blueprint for Data Mining, inJournal of Data Warehousing, 5(4): 13-22, 2000.

[9] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, The KDD process for extractinguseful knowledge from volumes of data, Communications of the ACM, Vol. 39,No. 11, 27-34, 1996.

20

Page 21: Grid resource management for data mining applications

[10] I. Foster and C. Kesselman, editors. The Grid 2: Blueprint for a New ComputingInfrastructure. Morgan Kaufmann Publishers, 2004.

[11] J. Nabrzyski, J.M. Schopf, and J. Weglarz (editors), Grid ResourceManagement: State of the Art and Future Trends, Kluwer Academic PublishersBoston, Dordrecht, London 2004.

[12] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools andTechniques with Java Implementation, Morgan Kaufmann, San Francisco (US),2000.

[13] T. Kosar, M. Livny, Aframework for reliable and efficient data placement indistributed computing systems, Journal of Parallel and Distributed Computing,No. 65, 1146 1157, 2005.

[14] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: AnOpen Grid Services Architecture for Distributed Systems Integration, GlobusProject, 2002, www.globus.org/research/papers/ogsa.pdf.

[15] I. Foster, K. Czajkowski, D.E. Ferguson, J. Frey, S. Graham, T. Maguire, D.Snelling, S. Tuecke, Modeling and managing State in distributed systems: therole of OGSI and WSRF, Proc of the IEEE, Vol. 93, No. 3, 604-612, 2005.

[16] V. Welch, I. Foster, C. Kesselman, O. Mulmo, L. Pearlman, S. Tuecke, J. Gawor,S. Meder, F. Siebenlist, X.509 Proxy Certificates for Dynamic Delegation, Procof 3rd Annual PKI R & D Workshop, 2004.

[17] A. Barbir, M. Gudgin, M. McIntosh Basic Security Profile Version 1.0 WorkingGroup Draft, online on http://www.ws-i.org/Profiles/BasicSecurityProfile-1.0-2004-05-12.html

[18] J. Rosenberg and D. Remy, Securing Web Services with WS-Security.Indianapolis: Sams Publishing, 2004.

[19] Della-Libera, G., Dixon, B., Garg, P., and Hada, S. Web Services SecureConversation (WS-SecureConversation). Kaler, C. and Nadalin, A. eds.,Microsoft, IBM, VeriSign, RSA Security, 2002.

[20] The Globus Alliance, A Globus Primer: Or, Everything You Wanted to Knowabout Globus, but Were Afraid To Ask. Describing Globus Toolkit Version 4,www.globus.org/toolkit/docs/4.0/key/GT4 Primer 0.6.pdf

[21] M. Litzkow, M. Livny, Experience with the Condor distributed batch system,Proc IEEE Workshop on Experimental Distributed Systems, 97-100, 1990.

[22] G. Allen, T. Goodale, T. Radke, M. Russell, E. Seidel, K. Davis, K.N. Dolkas,N.D. Doulamis, T. Kielmann, A. Merzky, J. Nabrzyski, J. Pukacki, J. Shalf, I.Taylor, Enabling Applications on the Grid: A Gridlab Overview, InternationalJournal of High Performance Computing Applications, Vol. 17, No. 4, 449-466,2003.

21

Page 22: Grid resource management for data mining applications

[23] F, Gagliardi, B. Jones, F. Grey, M-E. Bgin, M. Heikkurinen, Building aninfrastructure for scientific Grid computing: status and goals of the EGEEproject, Philosophical Transactions of the Royal Society A: Mathematical,Physical and Engineering Sciences, Vol. 363, No. 1833, 1729-1742, 2005.

[24] LCG-2 User guide, http://egee.itep.ru/User Guide.html

[25] C. Munro, B. Koblitz, Performance comparison of the LCG2 and gLite filecatalogues, Nuclear Instruments and Methods in Physics Research Section A:Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 559,No. 1, 48-52, 2006.

[26] G. Allen, W. Benger, T. Goodale, H. Hege, G. Lanfermann, A. Merzky, T.Radke, E. Seidel, J. Shalf, The Cactus Code: A Problem Solving Environmentfor the Grid, Proc of the Ninth International Symposium on High PerformanceDistributed Computing (HPDC), Pittsburgh, USA, IEEE Computer SocietyPress, Los Alamitos, CA, USA, 2000.

[27] D. Abramson, J. Giddy, L. Kotler, High Performance Parametric Modeling withNimrod/G: Killer Application for the Global Grid?, Proc of the InternationalParallel and Distributed Processing Symposium (IPDPS 2000), Cancun,Mexico, IEEE Computer Society Press, Los Alamitos, CA, USA, p. 520-528,2000.

[28] S. Venugopal, R. Buyya, L. Winton, A Grid Service Broker for Schedulinge-Science Applications on Global Data Grids, Journal of Concurrency andComputation: Practice and Experience, Wiley Press, USA Volume 18, No. 6,685 699, 2005.

[29] K. Nadiminti, S.Venugopal, H. Gibbins, T. Ma, R. Buyya, The GridBus Grid Service Brokerand Scheduler, http://www.Gridbus.org/broker/2.4/manualv2.4.pdf

[30] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I.Taylor, I. Wang, Programming scientific and distributed workflow with Trianaservices, Concurrency and Computation: Practice and Experience, Vol. 18, No.10, 1021-1037, 2005.

[31] Globus Toolkit 4.0: Information Services,http://www.globus.org/toolkit/docs/4.0/info/

[32] XML Path Language (XPath) Version 1.0 W3C Recommendation 16 November1999, http://www.w3.org/TR/xpath

[33] GT 4.0: Execution Management, online athttp://www.globus.org/toolkit/docs/4.0/execution/

[34] V. Kravtsov, T. Niessen, V. Stankovski, A. Schuster, Service-based ResourceBrokering for Grid-based Data Mining. Proceedings of he 2006 InternationalConference on Grid Computing and Applications. Las-Vegas, USA, pages163-169, 2006.

22

Page 23: Grid resource management for data mining applications

[35] J. Mandel, N. Palfreyman, W. Dubitzky, Modelling codependence in biologicalsystems. IEE Proc Systems Biol, 153(5), 2006. In Press.

[36] V. Kumar, A. Gupta, Analyzing Scalability of Parallel Algorithms andArchitectures, Journal of Parallel and Distributed Computing (special issueon scalability), Vol. 22, No. 3, 379-391, 1994.

[37] A. Grama, A. Gupta, V. Kumar, Isoefficiency Function: A Scalability Metricfor Parallel Algorithms and Architectures, IEEE Parallel and DistributedTechnology, Special Issue on Parallel and Distributed Systems: From Theoryto Practice, Vol. 1, No. 3, 12-21, 1993.

[38] Tim Banks, Web Services Resource Framework Primer, OASIS committeeDraft 01 December, No. 7, 2005.

[39] K. Czajkowski, D. F. Ferguson, I. Foster, J. Frey, S. Graham, I. Sedukhin, D.Snelling, S. Tuecke, W. Vambenepe. The WS-Resource Framework, March 5,2004.

[40] R. Quinlan, J. R. Quinlan, C4.5: Programs for Machine Learning, MorganKaufmann, San Francisco (US), 1993.

[41] I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, S. J.Cunningham. Weka: Practical machine learning tools and techniques withJava implementations. In N. Kasabov and K. Ko, editors, Proceedings of theICONIP/ANZIIS/ANNES’99 Workshop on Emerging Knowledge Engineeringand Connectionist-Based Information Systems, Dunedin, New Zealand, 192-196,1999.

23

Page 24: Grid resource management for data mining applications

Fig. 1. The DataMiningGrid Resource Broker in the DataMiningGrid system archi-tecture

24

Page 25: Grid resource management for data mining applications

Algorithm 1. Scheduling algorithm of the DataMiningGrid Resource Broker.

Function Schedule(jobs)

Obtain the set R of available grid resources (WS-GRAMs)

M = {R1, R2, . . . , Rm} ⊆ R {Filter out incompatible resources}

Sort M in descending order by |Ri| {|Ri| equals the number of idle CPUs in Ri}

unsubmitted jobs← jobs;T ← {}

DO WHILE unsubmitted jobs 6= {}

Rh ∈M,∀j 6= h, |Rj | ≤ |Rh| {Select highest-capacity WS-GRAM}

Submit min(|Rh|, |unsubmitted jobs|) jobs to Rh

IF during the submission Rh was detected as failed THEN

M ←M\Rh {Remove failed WS-GRAM from M}

IF M = {} ∧ T = {} ∧ unsubmitted jobs 6= {} THEN

Cancel all jobs in jobs\unsubmitted jobs

RETURN failure

END

Go to beginning of DO WHILE

END {if failed submission}

M ←M\Rh; T ← T ∪Rh {Move Rh from M to temporary list T}

Remove submitted jobs from list unsumitted jobs

IF M = {} ∧ unsubmitted jobs 6= {} THEN

Wait for pre-defined timeout

M ← T ;T ← {}

END {if submitted to all WS-GRAMs}

END {End of do while loop}

RETURN success

END {End of Function Schedule}

25

Page 26: Grid resource management for data mining applications

Fig. 2. The client-side job monitoring component

Table 1Basic test bed configuration.

Location No. CPUs Typical CPU Memory

University of Ulster, UK 50-70 Itanium II 900 MHz 2 GB

University of Ulster, UK 4-10 Pentium, 2 to 3 GHz 1 GB

Fraunhofer Institute, Germany 4 Pentium 1.4 GHz 1 GB

University of Ljubljana, Slovenia 20-40 Pentium 1.8 GHz 1 GB

Table 2Left: Raw run-time measurements of the ”long-job” Co-dependence Algorithm.Right: Speed-up according to Equation (1).

26

Page 27: Grid resource management for data mining applications

Table 3Left: Raw run-time measurements of the ”short-job” J48 Algorithm. Right: Speed-up according to Equation (1).

Table 4Scale-up experiment results for the ”long-job” Co-dependence Algorithm.

Table 5Scale-up experiment results for the short-job J48 Algorithm.

27