16
Noname manuscript No. (will be inserted by the editor) A Virtual Mart for Knowledge Discovery in Databases Claudia Diamantini · Domenico Potena · Emanuele Storti Received: date / Accepted: date Abstract The Web has profoundly reshaped our vi- sion of information management and processing, en- lightening the power of a collaborative model of in- formation production and consumption. This new vi- sion influences the Knowledge Discovery in Databases domain as well. In this paper we propose a service- oriented, semantic-supported approach to the develop- ment of a platform for sharing and reuse of resources (data processing and mining techniques), enabling the management of different implementations of the same technique and characterized by a community-centered attitude, with functionalities for both resource produc- tion and consumption, facilitating end-users with dif- ferent skills as well as resource providers with differ- ent technical and domain specific capabilities. We first describe the semantic framework underlying the ap- proach, then we demonstrate how this framework is ex- ploited to give different functionalities to users through the presentation of the platform functionalities. 1 Introduction The present paper proposes Knowledge Discovery in Databases Virtual Mart (KDDVM), a framework and a platform for distributed Knowledge Discovery in Da- tabases (KDD) experiments. As the name suggests, the aim is to provide a virtual environment where distribut- ed and heterogeneous KDD resources (i.e. data process- ing and mining techniques) can be easily introduced, acquired and exploited. In order to achieve the goal, in a five-years long work on the subject, we have individu- ated the major issues and requirements of a distributed Claudia Diamantini · Domenico Potena · Emanuele Storti DII, Universit´a Politecnica delle Marche, Ancona, Italy E-mail: c.diamantini,d.potena,[email protected] experimental environment, from which the technologi- cal aspects and basic components of a supporting ar- chitecture have been derived. We have also developed suitable technologies for the representation of knowl- edge about resources and developed suitable services for its management and exploitation. All these achieve- ments will be systematically described in the paper. The work is in the mainstream of the recent web- based revolution. The Web has profoundly reshaped our vision of information management and processing, soft- ware architectures and design, delivery models. All of this can be at least in part explained by the more than linear growth of value in a network of elements, synthe- sized by the Metcalfe’s Law. The social aspect emerging from the success of so-called Web 2.0 show the strategic value of a collaborative model of information produc- tion and consumption (Pedrinaci and Domingue 2010). As a matter of fact, more and more organizations share their data in a way that allows people to process and exploit them. The Semantic Web, and its recent evolu- tion in the Web of Data, through the use of machine- readable languages for the representation of knowledge and semantics, aims at simplifying the discovery, ex- ploitation and recombination of information. Data processing in the Web era follows a similar trend. In particular, the Service Oriented Architecture (SOA) paradigm is a powerful principle supporting the collaborative production and consumption of computa- tional capabilities, allowing to build new solutions by reusing and recombining existing ones, and changing the software vision from product to service provision. Indeed the use of libraries and modular programming is in vogue since the very beginning of the programming practices. What is different in modern SOA paradigm is the extensive use of metadata, supporting “reuse- in-the-large”, that is supporting the development of

A virtual mart for knowledge discovery in databases

Embed Size (px)

Citation preview

Noname manuscript No.(will be inserted by the editor)

A Virtual Mart for KnowledgeDiscovery in Databases

Claudia Diamantini · Domenico Potena · Emanuele Storti

Received: date / Accepted: date

Abstract The Web has profoundly reshaped our vi-sion of information management and processing, en-lightening the power of a collaborative model of in-

formation production and consumption. This new vi-sion influences the Knowledge Discovery in Databasesdomain as well. In this paper we propose a service-

oriented, semantic-supported approach to the develop-ment of a platform for sharing and reuse of resources(data processing and mining techniques), enabling the

management of different implementations of the sametechnique and characterized by a community-centeredattitude, with functionalities for both resource produc-

tion and consumption, facilitating end-users with dif-ferent skills as well as resource providers with differ-ent technical and domain specific capabilities. We first

describe the semantic framework underlying the ap-proach, then we demonstrate how this framework is ex-ploited to give different functionalities to users through

the presentation of the platform functionalities.

1 Introduction

The present paper proposes Knowledge Discovery in

Databases Virtual Mart (KDDVM), a framework anda platform for distributed Knowledge Discovery in Da-tabases (KDD) experiments. As the name suggests, the

aim is to provide a virtual environment where distribut-ed and heterogeneous KDD resources (i.e. data process-ing and mining techniques) can be easily introduced,

acquired and exploited. In order to achieve the goal, ina five-years long work on the subject, we have individu-ated the major issues and requirements of a distributed

Claudia Diamantini · Domenico Potena · Emanuele StortiDII, Universita Politecnica delle Marche, Ancona, ItalyE-mail: c.diamantini,d.potena,[email protected]

experimental environment, from which the technologi-cal aspects and basic components of a supporting ar-chitecture have been derived. We have also developed

suitable technologies for the representation of knowl-edge about resources and developed suitable servicesfor its management and exploitation. All these achieve-

ments will be systematically described in the paper.

The work is in the mainstream of the recent web-based revolution. The Web has profoundly reshaped ourvision of information management and processing, soft-

ware architectures and design, delivery models. All ofthis can be at least in part explained by the more thanlinear growth of value in a network of elements, synthe-

sized by the Metcalfe’s Law. The social aspect emergingfrom the success of so-called Web 2.0 show the strategicvalue of a collaborative model of information produc-

tion and consumption (Pedrinaci and Domingue 2010).As a matter of fact, more and more organizations sharetheir data in a way that allows people to process and

exploit them. The Semantic Web, and its recent evolu-tion in the Web of Data, through the use of machine-readable languages for the representation of knowledge

and semantics, aims at simplifying the discovery, ex-ploitation and recombination of information.

Data processing in the Web era follows a similartrend. In particular, the Service Oriented Architecture

(SOA) paradigm is a powerful principle supporting thecollaborative production and consumption of computa-tional capabilities, allowing to build new solutions by

reusing and recombining existing ones, and changingthe software vision from product to service provision.Indeed the use of libraries and modular programming is

in vogue since the very beginning of the programmingpractices. What is different in modern SOA paradigmis the extensive use of metadata, supporting “reuse-

in-the-large”, that is supporting the development of

2 Claudia Diamantini et al.

new complex distributed systems from heterogeneous,

loosely coupled components. The management of purelysyntactic metadata has been extended with the use ofsemantics, aiming at a uniform and rigorous specifica-

tion of service capabilities that facilitates automatic ser-vice discovery and composition.

The underlying principles of the Semantic Web and

Web 2.0 are present in our proposal. As a service-ori-ented, semantic-based framework, KDDVM enables thesharing and reuse of heterogeneous tools and workflows,

giving advanced support for the semantic enrichmentthrough semantic annotation of tools, deployment ofthe tools as web services and discovery and use of such

services. This leads to a collaborative, natively openenvironment where traditional as well as latest tech-niques can be dynamically added and exploited, and

where specific techniques can exceed the boundaries ofthe domain in which they were born and become avail-able to end-users with diverse knowledge and skills.

Service-oriented, semantic-supported platforms forKDD are not new in the literature. A review of existingsystems is given in the next section. What makes KD-DVM original with respect to other proposals is that

it is natively conceived for a distributed, collaborativeenvironment, that led to a novel semantic frameworkwhere resources are represented at three different ab-

straction levels: algorithm-tool-service, with a one-to-many relationship among levels. This is the fundamen-tal pillar on which the main features of the platform

are built: (a) the capability to manage various kindsof heterogeneities, like different implementations withdifferent characteristics of the same algorithm, (b) a

community-centered attitude, with functionalities forboth resource production and consumption, facilitat-ing end-users with different skills as well as resource

providers with different technical and domain specificcapabilities.

The rest of the paper is organized as follows: in sec-

tion 2 we review related work. Section 3 introduces therequirements seen as necessary for effectively support-ing KDD experimentations, then it discusses the frame-

work proposed to deal with these requirements. Section4 describes the system implementing such a framework.Finally, Section 5 reports some concluding remarks and

presents possible extensions of the work.

2 Related Work

Knowledge Discovery in Database is a complex process

for extracting valuable knowledge from huge amountsof data. Hence, various systems have been proposed inthe literature for supporting users in the design and

management of KDD processes. These systems can be

analyzed along their historical evolution: while the 1st-

generation of KDD and DM systems provided supportfor single users in local settings, 2nd-generation systemsare asked to address the decentralization of users and

tools, thus producing more complexity both for manag-ing distributed computation and for supporting coop-erative work and knowledge sharing among distributed

teams (Park and Kargupta 2003).

From a technical perspective, recent 2nd-generation

systems are based on SOA and Grid technologies. Theexploitation of service-oriented paradigm for distribut-ed KDD can be traced back to (Sarawagi and Nagaralu

2000), which introduced the idea of Data Mining mod-els as services with the aim of facilitating the use of suchtechniques among novice users. Such approach concerns

single, ready to use models which can be shared as wellas data, tools and skills, and it does not address designissues. Many authors have proposed SOA-based Data

Mining frameworks. In Anteater (Guedes et al 2006)authors focused their efforts on facing architectural is-sues such as communication among services as well as

management and load-balancing of parallel clusteredservers in order to achieve computationally intensiveprocessing on large amount of data. The platform re-

quires tools to be converted in a filter-stream structure,which allows high scalability but limit its extensibil-ity. Furthermore, the platform provides the user with a

simple interface that abstracts the algorithms’ technicaldetails. A change of the same perspective is proposedin (Kumar et al 2004), where authors describe a SOA

architecture for KDD, in which services are sent to thenodes where datasets are resident, instead of transfer-ring data, with the aim to help in lowering consumption

of bandwidth and to preserve privacy.

Using a different perspective, some authors intro-

duced user-oriented features both to provide support todata processing, and to manage a knowledge discoveryprocess. (Ali et al 2005) introduces a SOA architecture

in which users can manually build a KDD process madeof a set of pre-defined services from Weka toolkit (Hallet al 2009) and from third-parties. The generated pro-

cess is a workflow that can be executed through Triana(Majithia et al 2004). A similar approach is adopted in(Tsai and Tsai 2005), where users are provided with an

integrated interface for searching Data Mining services,building BPEL4WS processes and executing them.

Within the Grid community, many proposals dealwith architectures allowing Data Mining to be deployedin a distributed environment, often with a great benefit

for scalability. Among them, (Olejnik et al 2009) aimsat solving computational issues by massive parallelismof Data Mining tasks, also introducing algorithm specif-

ically suited for such architecture. Similarly, in (Cheung

A Virtual Mart for Knowledge Discovery in Databases 3

et al 2006) authors propose a Grid architecture for dis-

tributed Data Mining, although they mainly concernthe privacy issue by learning global models from localabstractions, to prevent sensible data from being sent

over the net.

In recent years, the overlap between the goals of

Grid computing and SOA based on Web services hasbecome clear, and new solutions are emerging that mixboth benefits, namely Service Oriented Grids, often

compliant with open standards like Open Grid Ser-vices Architecture (OGSA). Based on OGSA standard,(Perez et al 2007) focuses on a vertical and generic Grid

application, which allows the execution of Data Miningworkflows, mainly formed by Weka algorithms.

These 2nd-generation systems have focused on is-sues related to distributed execution, namely tool per-formances and privacy issues. Very few work has been

done on the definition of KDD-specific support func-tionalities like choosing algorithms to use, composingservices and appropriately setting algorithms parame-

ters. Hence, a 3rd-generation of systems has been intro-duced with the aim of supporting users in the design ofKDD processes in distributed and heterogeneous envi-

ronments. Proposals of this generation are based on thesemantic enrichment of resources.

Some research projects like Discovery Net, Grid-Miner and Knowledge Grid exploit resources metadatafor designing distributed KDD processes over the Grid.

Discovery Net (Alsairafi et al 2003) allows users to col-laboratively manage remotely available data analysissoftware as well as data sources. In Discovery Net pro-

cesses can be described through an abstract language,which relies on metadata associated to each tool. Oncean abstract process is defined, it is translated into a

concrete plan to be executed on specific servers, andalso made available as a new service. A related recentproject is GridMiner (Kickinger et al 2004), an OGSA-

based system which provides a collection of KDD ser-vices, together with a set of modules for managing low-level and high-level functionalities on the Grid, such

as a service broker, OGSA-DAI based data integration,orchestration and composition. Knowledge Grid, firstlyintroduced in (Cannataro and Talia 2003) and discussed

in detail in (Congiusta et al 2008), represents the firstattempt to build a domain-independent KDD environ-ment on the grid. The main focus of the approach is on

high-performance parallel distributed computing andon generic high-level services for knowledge manage-ment and discovery. While the high-level layer is used

to create abstract execution plans, the core layer keepsmetadata describing KDD objects (data and results)and manages the mapping between a workflow and the

available resources for its execution over the grid. In

(Congiusta et al 2008) is also described WekaWS, an

implementation of the mentioned architecture, in whichstandard Weka algorithms are available over the Gridas services, and many support services are extended to

take into account grid peculiarities, for instance to al-low a workflow with parallel execution of services.

Advanced support functionalities have been intro-duced in various proposals by means of KDD ontolo-gies. In (Bernstein et al 2005), the ontology provides a

conceptualization of DM tools’ interfaces allowing theautomatic composition of valid KDD processes. Sincethis ontology contains even information about tools per-

formances (e.g. execution speed), the system is able tosuggest more efficient workflow. The process composi-tion is treat also in Orange4WS (Podpecan et al 2010),

where a KDD ontology representing tools, datasets andmodels is introduced for guiding a planning algorithmin the process design. Resulting workflows are written

to run in the Orange platform (Demsar et al 2004),which is a Data Mining framework providing easy-to-use functionalities for composition and execution of pre-

defined data mining tools. A fully annotated Grid plat-form is proposed in (Comito et al 2006), where anyresources is semantically enriched by referring both to

ontologies describing the application domain and to theDAMON ontology, which is the first attempt of givinga formal representation of DM domain. This ontology

is mainly a taxonomy of DM tools and is exploited fortools retrieval. Another Grid based system is proposedin (Yu-hua et al 2006), where the ontology is used to

guide the selection of suitable processes; to this end,the ontology describes algorithm sequences and theirrelations with the specific application domain they are

used for. A different perspective is introduced in (Panovet al 2008), where a general purpose (i.e. not-conceivedfor achieving specific support functionalities) ontology

is proposed; hence, systems based on such an ontologycan be used for different activities, but providing inef-ficient supports in each of them.

Many other intelligent data analysis systems, whichcan be defined as general frameworks for workflow man-

agement, data manipulation and knowledge extraction,are described in the survey (Serban et al 2010).

Analyzing the literature, we find in the e-Sciencecommunity systems similar to those above described;indeed, the KDD domain can be considered as a branch

of e-Science. Such systems aim at managing enormousamounts of data produced during experiments (e.g., inbiomedicine, particle physics), exploiting technological

architectures like Grid for achieving efficient and scal-able computation, and with a strong focus on collabora-tion in distributed settings. Among the many projects,

MyExperiment (De Roure et al 2009) is trying to com-

4 Claudia Diamantini et al.

bine computational efforts typical of Grid with semantic

technologies. In detail, it is a middleware for support-ing personalized in-silico experiments, where scientistscan collaboratively design, edit, annotate, execute and

share workflows in the biology domain.Although the discussed literature shares some as-

pects with our platform, the design of KDD services and

process is based on tools with homogeneous interfaces,hence simplifying process composition and activation.In our proposal, we deal also with issues deriving from

an autonomous distributed design, e.g. often services tointegrate in the platform run in various environments,typically more than one service implementing a specific

algorithm is considered, compatible outputs are pro-duced by services, and so forth. We introduce function-alities to build a platform supporting a KDD commu-

nity, including not only analysts using services, but alsodevelopers as well as autonomous organizations hostingservices over their machines.

3 The Framework

This section is devoted to present the KDDVM frame-work. In its essence, it defines and organizes the bulk of(metadata and semantic) information needed to com-

ply with requirements of a KDD support system in acollaborative networked environment, and formalizes itby suited technologies.

The design of the framework starts with the ideathat there is no standard way to deal with a knowledgediscovery problem. This fact leads to three major con-

siderations: 1) novel tools and algorithms for data pro-cessing and analysis are continuously developed. Manyof them are developed in specific domains, like e.g. life

science, 2) analysts should be provided with the highestpossible number of KDD tools in order to perform thework in the most effective way, and 3) analysts should

be enough experienced in the (combined) use of thesetools. Hence, in this scenario, when a new tool is devel-oped or a new algorithm is proposed, the organization

has to invest resources for making this tool interop-erable with its own analysis system, and for trainingpeople in the effective use of the tool (or for external

consultants). Although this type of investment couldbe established, a KDD project typically involves sev-eral kind of users, some of whom neither have a specific

background in data analysis field nor hold enough tech-nical expertise in order to manage a whole project ontheir own. Among them, for instance, there are domain

experts, which intimately know the problem and areable to assess whether the knowledge extracted at theend of the process is useful or not in order to solve

it. Then, DB administrators have to gather data from

databases/datawarehouse and possibly perform trans-

formation operations.Such a scenario depicts a KDD project as a collab-

orative and distributed work, where several users, pos-

sibly from different organizations, make tools availableand share knowledge and expertise. Our goal is to sup-port users with such different degree of skill and exper-

tise, facilitating the adoption of novel tools, the choiceof the ”right” tools and their composition, and collab-oration. Hence, we envisage the following general, non-

functional requirements for a system effectively sup-porting KDD experimentations:

a) flexibility : the system should adapt easily to anymodification in the experiment, like the changingof parameters and tools. It should grow with the or-

ganization needs providing mechanisms for seamlessintegration of new techniques and tools;

b) transparency : tools with different and heterogeneous

interfaces should be managed, hiding localizationand execution technicalities from users. Most op-erations on data should be automated;

c) ease-of-use: the system should provide KDD-specificsupport to users with different skills, ranging fromKDD experts to novice users and domain experts. In

particular, while principles underlying Data Miningalgorithms should be known to computer scientists,this cannot be expected for domain experts, that

should be supported in choosing best tools for theirgoals, in setting up the right parameters, in combin-ing different tools, in the management of the whole

process, in sharing domain knowledge and so forth;d) reusability : the framework should allow to use avail-

able tools without imposing any modification on

their code, and to exploit available information to amaximum extent.

KDDVM relies on an open and modular architecture.By the term Service Oriented Architecture we refer to

“a style of building reliable distributed systems that de-liver functionality as services, with the additional em-phasis on loose coupling between interacting services”

(Treadwell 2005). Thus, like in any SOA, each KDDtool is regarded as a modular service that can be dis-covered and used by a client, with the following bene-

fits: they can be used independently or integrated forproviding more complex functionalities; each KDD toolis provided with a public description of its interface

and other information (e.g., supported protocols, ca-pabilities), but its implementation and other internaldetails are no concern with clients and remain hidden;

clients communicate with services by exchanging mes-sages. Moreover, focusing on loose coupling, the sys-tem allows to dynamically add and remove KDD ser-

vices, update their implementation or suggest alterna-

A Virtual Mart for Knowledge Discovery in Databases 5

tive services providing the same functionalities, in case

of unavailability. Hence, the SOA paradigm partiallyaddresses previous requirements. With the aim of sat-isfying at a higher extent the requirements, in particular

ease-of-use, we add two more levels of resource descrip-tion, forming a three-layer architecture as follows:

– at the algorithm level, the resource is seen as a pro-

totypical tool describing capabilities without anyimplementative detail;

– at the tool level, the specific implementations of an

algorithm, in a given programming language is rep-resented;

– at the service level, characteristics of a tool running

on a server, offering its interface through standardSOA protocols are considered.

Fig. 1 Abstraction layers and related information.

This three layer architecture possesses a hierarchi-cal structure, in that many services can refer to thesame tool and several tools can implement the same

algorithm, while on the other way, each service is thedeployment of a specific tool that in turn implementsan algorithm. From an informative perspective, this

means that all the characteristics defined at one ab-straction level, are inherited by the lower level(s). Inturn, lower levels posses other layer specific information

as well as specifications of the general characteristics ofthe upper level. To be more specific, an algorithm isdescribed by its inputs and outputs (e.g a dataset and

the mined model), the task it is aimed at (e.g. clas-sification), the method it uses to accomplish its task(e.g. decision tree), and some performance indexes (e.g.

complexity). By contrast, a tool has its specific inter-

face, it is written in a programming language, it has

an execution path and some specific performance val-ues obtained from the execution upon specific datasets(e.g., accuracy value on Iris dataset (Frank and Asun-

cion 2010)). Note that the tool inherits the abstractcharacteristics of the algorithm it implements, for in-stance if the algorithm takes in input a labeled dataset,

the tool still has the same kind of input, however atthis level also the format of the input file is specified(e.g. arff, or csv). Also, the tool can have its own in-

puts or outputs in addition, like e.g. a parameter al-lowing to set the verbosity level of the output. Finally,the service level adds information about the URL and

peculiar QoS indexes (e.g., availability) which dependon the properties of the server on which it is executed,and on the network status. The information related to

each layer is summarized in Figure 1.

The information managed at the upper layers is fun-damental to satisfy the ease-of-use requirement. As a

matter of facts, to give KDD-specific support calls forthe formalization of the KDD principles and practicestypically owned by an expert. To make novice user ca-

pable of managing services, we formalized KDD knowl-edge into an ontology (KDDONTO), which allows todescribe algorithms and their properties and relations

in a formal fashion. The ontology allows us to define aformal common ground on the basis of which most ofthe support is given. It includes:

– heterogeneity management and service integration;

– enhanced service discovery and composition;– best practices for service activation;

Details on semantic-based support functionalities are

given in section 4.

On the other hand, conceptual information is notsufficient by itself since most of the heterogeneities ap-

pears at the tool level, when different implementationsof an abstract algorithm coexist in the distributed en-vironment. This makes integration and reuse of tools

very time consuming. For instance, before actually us-ing a tool, a user should understand the syntax of theinput and conform his data. For such a reason, in order

to allow the usage of several tools in a single frameworkwe provide each of them with a description in a com-mon, XML-based language. Functionalities enabled by

the descriptor are discussed in Section 4.1.

In the rest of this section, a more detailed descrip-tion of each layer is provided: the ontology for describ-

ing resources at algorithmic level is summarized in sub-section 3.1, in subsection 3.2 we introduce the languageto annotate resources at tool layer. Finally, subsection

3.3 will focus on resources at service layer.

6 Claudia Diamantini et al.

3.1 Algorithm Layer

KDDONTO is an ontology aimed at representing themain concepts and relations within the KDD domain in

a formal fashion. Among many available methodologiesfor ontology building, we developed the ontology follow-ing a formal approach based on (Noy and Mcguinness

2002; Gruber 1995; Fernandez-Lopez et al 1997) takinginto account several quality criteria in order to guaran-tee clarity, coherence, extensibility and minimality.

Formalism in concepts definition allows to supportinferential mechanisms, to find non-explicit relationsamong the ontological concepts: for instance, it permits

to determine whether a certain kind of data is compat-ible with the input of an algorithm, or to automaticallyassign an algorithm to the right class, on the basis of

its relations with other concepts.

The main purpose of KDDONTO is to represent

general properties of algorithms, in that differing fromother proposed ontologies which use a more applicativeperspective as already discussed in section 2. Tables 1

and 2 show the main classes and relations. Besides thecentral concept of algorithm, the ontology describes thetask, the KDD phase in which it is commonly used, the

method it implements for achieving its task. Moreover,the I/O interface is described: the data required in in-put and yield in output, together with the preconditions

that such data must satisfy in order to be actually used.Finally, computational indexes like complexity and scal-ability are represented as well. Among relations, we like

to note that in module, out module and in contrast

are introduced to elicit expert knowledge about processcomposition typical practices. A peculiar characteris-

tic of KDDONTO is the introduction of the part of

relation, motivated by the need to manage structureddata. Finally note that, differing from other proposals,

the main concepts of algorithm, method and tasks arerelated each other by many-to-many relations, allowingfor instance to associate a neural network method to

classification as well as regression tasks.

To give an example, Figure 2 shows an excerpt ofthe ontology describing the C4.5 algorithm, which per-

forms the classification task and generates as predic-tive model a decision tree. Thus, in the KDDONTOit is defined as an instance of TreeAlgorithm class.

This last is a subclass of ClassificationAlgorithm,which contains all algorithms that use a method suit-able for classification. Indeed, C4.5 algorithm produces

as output a BinaryDecisionTree model, which is aDecisionTreeModel that in turn is a subclass ofClassificationModel. As input it accepts an instance

of LabeledDataset. For further details about the struc-

ture of the ontology, we refer the interested reader to

(Diamantini et al 2009a).

Fig. 2 An excerpt of the KDDONTO’s description ofthe C4.5 algorithm visualized by the OntoGraf module ofProtege.

3.2 Tool Layer

In order to describe services at tool level, we intro-duced the Knowledge Discovery Tool Markup Language(KDTML). It is an XML-language aimed at annotating

a KDD tool through a set of metadata, in order to de-scribe its details in a structured fashion. Legacy KDDtools, as well as free KDD software, produced by third-

parties and written in any programming language canbe described through a KDTML document in order toallow advanced support for sharing and integration.

The structure of a KDTML document is formed bythe following sections:

1. development/execution: tool’s name, programming

language, local execution path;2. I/O interface: number and type of each I/O datum,

syntax in which data must be provided;

3. algorithm: the concept and the ontology that de-scribes the algorithm which is implemented by thetool;

4. tool performances: data-dependent and data-inde-pendent performance values; the former describe thetool’s behavior w.r.t. previous executions (e.g., ac-

curacy on certain datasets), while the latter refersto intrinsic properties of the tool;

5. publication: the author of the KDTML document,

the publication date.

To exemplify the use of the main elements, in Figure

3 a fragment of a KDTML document is shown. The

A Virtual Mart for Knowledge Discovery in Databases 7

Table 1 KDDONTO: main classes.

Name Description ExamplesAlgorithm algorithm for data analysis SVM, RemoveMissingValues, PrincipalComponentAnalysisMethod technique to extract knowledge KernelMethod, RandomFill, FeatureExtractionPhase step in a KDD Process Modeling, PreProcessing, FeatureExtractionTask Data Mining Task Classification, RegressionData I/O Data Dataset, LearningRate, ModelModel type of I/O Data Any induced classification modelDataset collection of data records Iris datasetDataFeature characteristics of data Numeric, Literal, MissingValues, BalancedDataset, NormalizedDataset

Table 2 KDDONTO: main relations.

Name Description Examplesuses(Algorithm, Method) relation between an algorithm and

the used methoduses(BACKPROP, NEURALNETWORK)

specifies task(Method, Task) relation between method and atask

specifies task(NEURALNETWORK, CLASSIFICATION)specifies task(NEURALNETWORK, REGRESSION)

specifies phase(Task,Phase) a task to a phase specifies phase(CLASSIFICATION, MODELING)has input(Algorithm∪Method∪Task,Data, Datafeature, strenght,is parameter)

an input to an algorithm, methodor task. DataFeature are precon-ditions on input Data. strenghtdefines whether the preconditionis mandatory (strenght=1) orit can be relaxed (strength<1).is parameter defines whether theinput is a parameter

has input(SOM,UnlabeledDataset, NO MISS VALUE,1,0)has input(SOM, VectorQuantizer, FLOAT, 0.4, 0)has input(SOM, LearningRate, null, null, 1)has input(CLASSIFICATION, Data)

has output(Algorithm∪Method∪Task,Data, DataFeature)

an output for an algorithm,method or task. DataFeature arepostconditions resulting from datamanipulation by the algorithm

has output(SOM, VectorQuantizer, NO LITERAL)has output(CLASSIFICATION, Data)

is a(Thing, Thing) the generic subsumption relationbetween a class and its superclass

is a(ClassificationAlgorithm, Algorithm) is a(Model,Data)is a(Dataset, Data) is a(Parameter, Data)

part of(Data, Data) the relation between two data,such that the second contains thefirst as a subcomponent

part of(Label, LabeledDataset)part of(Neuron, MLP)

in module/out module(Algorithm, Al-gorithm)

explicit suggestion (best practice)in linking together two algorithms

in module(LVQ, SOM)out module(SOM, DRAWVORONOIREGIONS)

in contrast(DataFeature, DataFea-ture)

disjoint properties of data in contrast(NUMERIC, LITERAL)

described tool, namely J48weka, is a tool implement-ing the Weka version of the C4.5 algorithm. We hastento note the use of RDF triples to semantically anno-

tate KDTML information. This is done in the examplefor the algorithm tag, in order to express the mean-ing of the label j48 in terms of the reference concept

C4.5 of the kdontology.owl ontology. Similarly, RDF isused to link the publisher to an external URI. Each el-ement in the descriptor is identified by an XML tag,

in which the prefix kdtml: as usual denotes the names-pace of the KDTML’s DTD, which defines the syntaxof each structural element. Further information about

each KDTML element is available in (Diamantini andPotena 2008). Here we like to enlighten only the use ofthe hidden data structure to describe an input that the

software needs, but whose structure is not given explic-itly on the standard input. The typical case is that ofa file containing the training/testing datasets. Even if

the user typically supplies only the name of this file asinput parameter, the tool reads the content of the fileaccording to a fixed predefined structure, that has to

be known by the user to format the file correctly. Todescribe the hidden structure of a parameter, a C-likeI/O format is used (an ARFF file in the example) as

part of a structured datum, referred by the value of thepar ref tag.

3.3 Service Layer

In general terms, services are loosely coupled entitiesthat encapsulate reusable functionalities, and are de-fined by implementation-agnostic interfaces. As previ-

ously mentioned, each KDD tool in our platform has tobe available on the Net as a service, according to SOAprinciples. In practice, actual services can be offered by

using several implementation technologies, the most im-portant of which are Web Services and Service-OrientedGrids. While the former is a well-known and mature

technology aimed to support interoperability, the lat-ter is a rather novel form for distributed computing, inwhich traditional grid concepts (e.g., virtualization, col-

lective management, adaptation, high-performances) gowith emerging standards from several segments of theWeb Service community. Although our approach can

be considered independent of the specific technology, inour platform we choose to rely on Web Services sincethey represent a more mature technology and offer a

comprehensive set of full-fledged tools and solutions.

8 Claudia Diamantini et al.

<kdtml:KDD tool><kdtml:name>j48weka</kdtml:name><kdtml:description>This is a tool for generating a pruned or unpruned C4.5decision tree. This software is part of the WEKA project. . .

</kdtml:description>. . .Location, Development and Execution information . . .<kdtml:input>

<kdtml:parameters><kdtml:par simple><kdtml:par name>-t </kdtml:par name><kdtml:par type>file</kdtml:par type><kdtml:par ref>data file</kdtml:par ref><kdtml:par description>

Input data training file</kdtml:par description><kdtml:par op/><kdtml:par default>file.arff</kdtml:par default>

</kdtml:par simple>. . . list of input parameters . . .

</kdtml:parameters><kdtml:data>

<kdtml:structured><kdtml:hidden/><kdtml:name>data file</kdtml:name><kdtml:data><kdtml:simple><kdtml:hidden/><kdtml:type>String</kdtml:type><kdtml:name>data file name</kdtml:name><kdtml:description>

file with training data samples</kdtml:description>

</kdtml:simple></kdtml:data><kdtml:txt file format>

<kdtml:string>“@RELATION”\t%s\n(\n)*(“@ATTRIBUTE”\t%s\t(%s)|(“{”(%s,)*%s“}”)\n)*

“@DATA”\n((%s,*)(%f,*)(%s\n)|(%f\n))*</kdtml:string>

</kdtml:txt file format></kdtml:structured>

</kdtml:data>. . . list of other input files . . .

</kdtml:input>. . . list of output files . . .<rdf:Description about=“.../kdontology.owl#C45”>

<kdtml:algorithm>J48</kdtml:algorithm></rdf:Description><kdtml:publication><rdf:Description about=“http://www.diiga.univpm.it/

diamantini”><kdtml:publisher>C. Diamantini</kdtml:publisher>

</rdf:Description><kdtml:date>22012008</kdtml:date><kdtml:KDTML version>2.1</kdtml:KDTML version>

</kdtml:publication></kdtml:KDD tool>

Fig. 3 An example of KDTML Document. WEKA J48 tool.

The key specifications used by Web Services are

XML descriptors written according to the Web Ser-vice Description Language (WSDL), for representingattributes, interfaces and other properties. However,

the WSDL 2.0 W3C Recommendation does not includesemantics in the description of Web services. Giventhat it is widely recognized that resolving such am-

biguities in Web services descriptions is an importantstep toward automating the discovery and composi-tion of Web services, Semantic Annotation for WSDL

and XML Schema (SAWSDL) defines mechanisms by

which semantic annotations can be added to WSDL

components such as I/O message structures, interfacesand operations (Farrel and Lausen 2007). In our frame-work, each service is described by an extended version of

SAWSDL specifications. SAWSDL extension has beennecessary to allow us representing the whole bunch ofinformation available in KDTML. For such a reason, we

provide each service with an extended-SAWSDL docu-ment (eSAWSDL), which is fully compatible with theSAWSDL standard and has some additional details,

namely specific syntax of I/O data or performance val-ues, necessary to support tasks such as choosing themost high-performing service, or finding a service which

has an input interface syntactically compatible to theoutput interface of the service at hand.

4 KDDVM Platform

Given the three description layers introduced in the pre-vious section, each computational unit in our frame-work is a service. Among the services offered by the

platform we recognize basic services, which provide sin-gle Data Mining and KDD functionalities allowing toanalyze and transform data and to extract knowledge.

Moreover, a set of support services is provided for givingboth low-level and KDD-specific support functionali-ties to different categories of users. We classify users on

the basis of their role: publishers, providers and con-sumers. The publisher is responsible for tool descrip-tion by means of KDTML language, referring to con-

cepts of the KDDONTO. Notice that the publisher isnot necessarily the tool developer, though he shouldhave some knowledge and skills about the published

tool. The provider (i.e. business and organization) takescharge of making the tool available as a web service onhis own platform, building the eSAWSDL document,publishing the service in a registry, managing techni-

cal issues and ensuring appropriate quality-of-service.Finally the consumer is the final user (i.e. domain ex-perts, DBAs, Data Mining specialists) that uses plat-

form’s functionalities to solve specific KDD problems.

Concerning publishers and providers, the platformexposes a set of services supporting the deploymentphase, which implement all the functionalities needed

for making a given KDD tool available as a basic ser-vice. In particular, such kind of services are used todescribe the tool at any layers, making the service avail-

able over the host machine, and publishing it in a publicregistry. Services for the deployment phase are deep-ened into details in subsection 4.1.

From the consumers’ viewpoint, we divide support

services into three main categories, on the basis of the

A Virtual Mart for Knowledge Discovery in Databases 9

Fig. 4 KDDVM: Services supporting Discovery, Composi-tion and Activation. The KDDDesigner as access point.

provided functionality: discovery, composition and acti-vation. The first kind of services supports the browsingof basic services into the repository and the retrieval

of details about each service. Composition services en-able the design of a process at the different abstractionlevels ranging from a process of algorithms to an ex-

ecutable workflow of web services. They support alsothe choice of the most suitable services on the basis ofuser requirements (e.g., useful for a certain task, exe-

cutable after the service at hand, providing a certainlevel of performances on some kinds of problems). Fi-nally, the third category manages the execution of the

whole process as well as of a single service, and the datatransfer from a service to another. These categories aredescribed in subsections 4.2, 4.3 and 4.4 respectively.

In Figure 4 we show an overview of the platform andits main support services divided by category. Follow-ing the loose-couple principle underlying a Service Ori-

ented Architecture, support services can be composedin order to define more complex functionalities.

Note that, without denying the principle that a WebService can be accessed by various end-user interface(by using the SOAP protocol), the KDDVM platform

provides even some consumer-side applications withuser-friendly graphical interfaces. Technical details areavailable on KDDVM project web site1, where one can

access the platform and use available applications.

1 http://kddvm.diiga.univpm.it/

4.1 Deployment

Before using the platform, services have to be published

into a repository, from which final users can retrievethem. For this reason, a set of services and consumer-side applications are available for (1) transforming a

KDD tool, written by a developer in any programminglanguage, into a web service with a standard interface;for (2) deployment of such a service in an application

server, and (3) publishing into the common registry. Asshown in Figure 5, the mentioned steps are carried outby the following applications:

– Annotation Generator (AnGen): consumer-side ap-

plication to support a user in writing the KDTMLdescriptor for a given tool. By analyzing KDTML’sDTD (which defines the syntactical structure), An-

Gen guides the user in filling in every part, thussimplifying its writing and avoiding syntax typos.

– Automatic Wrapping Service (AWS): web service

capable to transform a tool written in any program-ming language in a standard web service (i.e. a basicservice), by reading its KDTML descriptor. In de-

tail, AWS encapsulate any tools, even legacy soft-ware, by building a wrapper that allows to activatesuch a tool via standard web service communica-

tion protocols. This wrapper is given as a java codethat will be executed over the host machine. More-over, AWS produces the eSAWSDL descriptor and

all the needed scripts for deploying the service onthe publisher’s server.

– Authorization Service (AuthS): web service that en-

ables a user both to execute a service and to accessfiles in specific directories, on the basis of permis-sions the user has.

– File Transfer Service (FTS): this web service, avail-able on each publisher’s server, allows the user toupload the code of the wrapper and the eSAWSDL

document in the specific directory. Since this func-tionality involve writing and reading information ondisk, the service makes use of AuthS service.

– Deployment Service (DS): this web service is respon-sible to compile the code of the wrapper as providedby the AWS, and to deploy the service on the server,

asking permission to AuthS.– Broker: a support service to discovery and publish

services. For the publication, the Broker is responsi-

ble to store information about the service inside theUDDI registry, namely generic detail about the ser-vice and the publisher, and the eSAWSDL’s URL.

To go beyond UDDI limits we use some UDDI stan-dard constructs to add specific eSAWSDL detailssuch as which algorithm is implemented by the ser-

vice and information about performance values (a

10 Claudia Diamantini et al.

Fig. 5 KDDVM: services for the deployment phase.

preliminary version using an extended WSDL is pre-

sented in (Diamantini et al 2007)).

Since AuthS, FTS and DS interact with the oper-ating system of the server, they are the only servicesneeded for hosting a node of our platform. On the con-

trary, the AWS does not have to run over the serverwhere the basic service will be deployed.

We like to note that these support services facilitatethe collaborative networking, in fact assisting the com-

munity in sharing tools and resources they reduce thebarriers to entry of new users.

In order to automate the service deployment activ-ities, we put at disposal WSClient a consumer-side ap-

plication that, orchestrating the above services, auto-matically generate A wrapper (AWS), upload the need-ed files (FTS+AuthS), deploy (DS) and publish the ser-

vice (Broker); In Figure 6 these steps are shown for thedeployment of the service implementing the J48wekatool. Furthermore, the BrokerClient is made available

to interact with the Broker supporting the service pub-lishing and, in particular, the service discovery as de-scribed in the following.

4.2 Discovery

In the typical scenario, besides the publishing function-

alities introduced in the previous subsection, the maingoal of a service broker is to help users in obtaining in-formation about services, in particular the location of

the service provider hosting their descriptors.

Fig. 6 A screenshot of the WSClient, where the Weka im-plementation of the C4.5 algorithm is deployed.

On the one hand, UDDI standard allows providers

to describe how their services can be accessed, whichinformation they want to make public, and which theychoose to keep private. On the other hand, UDDI sup-

plies customers with a set of standard API for servicediscovery, even if only syntactic search is enabled: as amatter of fact, users may search services only by their

names (i.e., white pages) or by a plain taxonomy (i.e.,yellow pages). Furthermore, although UDDI entities aredefined by XML, they lack explicit semantics, thus lim-

iting the support that can be offered to a user in theKDD domain. In order to enhance UDDI’s expressive-ness, both UDDI structures and standard APIs have

been extended, to allow the mapping between a ser-vice and an abstract algorithm, thus improving search-ing capabilities. From a technical point of view, such

modifications regard the introduction, for each anno-tated service, of a customized categoryBag structure,which introduces an “algorithm-ontology” pair repre-

senting the name of the algorithm implemented by theservice and the KDDONTO’s URL, where the algo-rithm is formally defined. A further extension concerns

performance indexes and values, which are stored in another categoryBag and allow to use the search API tofind services whose performances are below/above cer-

tain thresholds. Since Broker is available as a web ser-vice, its APIs can be called by any external tool suchas the BrokerClient or the KDDDesigner introduced in

the next subsection. Details about the adopted UDDIstructure and Broker APIs are discussed in (Diamantiniet al 2007).

By means of such new features, searching capabili-

ties move from purely syntactic to semantic-based. Inorder to explain what a semantic query gives to theuser, let us consider the following case: a user has to

build a predictive anti-spam filter, and she has at dis-

A Virtual Mart for Knowledge Discovery in Databases 11

Fig. 7 A screenshot of the BrokerClient returning information about services implementing the C4.5 algorithm.

posal a set of previously classified emails (spam/no-

spam). The user knows that spam emails are much lessfrequent than no-spam ones, and that classification al-gorithms have bad performances on unbalanced data.

Then she decides to preprocess data by undersamplingthe most frequent class, but she does not know whatservices are available in the registry. The use of stan-

dard UDDI white pages would be of no help, since shedoes not know services names. Even the use of yellowpages would lead to incomplete results, for instance, it

is likely that she would not be able to find solutions likek-NN. The reason is that k-NN is a classification algo-rithm that is also used for undersampling (Zhou and

Liu 2006). Since in a plain taxonomy only one categorycan be assigned to each algorithm, the best choice isto define k-NN as a classifier. On the other hand, if the

taxonomy is flat and categories contain a larger numberof algorithms, it is likely that many algorithms that cannot be used for undersampling are returned. The logical

organization of KDDONTO, which uses relationships tolink classes, allows a more consistent representation ofthe reality by assigning k-NN to both undersampling

and classification categories. Thus more complete andaccurate results are returned to the user.

After balancing the dataset, the user is interestedin finding a classification service suited for her data,

which have some nominal attributes and missing val-ues. Again, standard UDDI registries would not giveany support to this kind of searches. Reasoning over

the ontology, the Broker returns the right candidatesby looking at algorithms performing the classificationtask, whose inputs satisfy the required properties. Fig-

ure 7 shows a screenshot of the BrokerClient applica-

tion, where on the left a subset of classification algo-

rithms satisfying the user’s request (i.e. able to processdata with nominal and missing values), and on the bot-tom left some details of the service implementing the

highlighted algorithm, namely the service wrapping thej48weka tool described in the previous section. In par-ticular note that the availability is reported as an ex-

ample of performance index.

4.3 Composition

The most important problem during the process designis to generate a meaningful sequence of tools such that

(a) it is able to process the user dataset, (b) it solves theuser problem, and (c) it is valid and semantically cor-rect, i.e. tools are syntactically interfaceable each other

and the use of a tool’s output as input for an other toolmakes sense in the KDD domain.

We base our composition approach on the defini-tion of matching criteria, which are able to formally

evaluate whether (and to which extent) an algorithminterface is compatible with a certain datum, thus al-lowing to verify whether (a) an algorithm can accept

a certain dataset as input, (b) an algorithm yields acertain model and so it satisfies a certain task and (c)an algorithm is executable before/after another. These

criteria are based on the semantics of algorithms (de-scribed in KDDONTO) and a proper conceptualizationof the KDD domain, by annotating each element of a

process (user dataset, goal, interfaces) as follows:

(a) the user dataset is annotated by the syntactic de-

scription of the format of its content and a set of

12 Claudia Diamantini et al.

ontological terms which semantically describe its in-

tensional properties. For instance, a labeled datasetwithout missing values, balanced with respect to itsclasses and including only float numbers will be an-

notated as a LabeledDataset, with the property setno missing values, balanced, float;

(b) the user goal can be specified by the user as one of

the KDD task available in the ontology (e.g., CLAS-SIFICATION, CLUSTERING, RULE ASSOCIATION);

(c) service interfaces are both syntactically and seman-

tically fully described in the eSAWSDL descriptor,given that each input/output datum’s descriptionincludes a reference to the corresponding concept in

KDDONTO, as explained in 3.3.

Each algorithm takes data with certain features in in-put, performs some operations and returns data in out-

put, which are then used as input for the next algorithmin the process. Thus, two algorithms can be matchedif the output of the first is compatible with the input

of the second. The check is performed by comparingthe data being connected, not only syntactically butalso semantically, with the aim to understand if they

are conceptually equivalent, or if the output is a sub-type/subcomponent of the input. Such operation is re-alized by looking for a path, inside KDDONTO, be-

tween the input and the output, made of the ontologicalrelations sameAs, is a and part of, and by properlyweighting each relation. For instance, we can feed an

algorithm with a LabeledDataset even if it requires aNotLabeledDataset because the latter is a part of theformer. As regards the weight of a relation, if a concept

is subclass of another it inherits all the properties of itsparent, thus using a specialization instead of its gener-alization does not hamper services composition; on the

contrary, if a part of relation exists, a data manipula-tion operation has to be performed before linking thetwo services, hence increasing matching weight.

According to the length of the path and the weightsof each link, a matching cost is evaluated. The eval-uation of the cost takes into account also the compu-

tational complexity of algorithms and the existence ofpreconditions on the input datum, that can be com-pletely or partially satisfied by the output. In order

to evaluate the cost matching for a given pair of algo-rithms, the platform includes the MatchMaker Service,which implements the matching criteria.

The composition functionalities are implemented inKDDDesigner, a web-based blackboard-like visual toolaimed at supporting users in the complex task of com-

posing a collaborative KDD process out of a set of KDDtools. KDDDesigner has been conceived to be the accesspoint to the whole platform (see Figure 4). Through a

search form, users can query the Broker in order to find

services, that then can be dragged and dropped in the

design blackboard. When a connection between two in-put/output is established by the user, the MatchMakerservice is called to evaluate the semantic cost of their

connection, and hence to verify whether the data canbe actually connected. As mentioned in 4.2, the Match-Maker is also exploited during the service discovery. In

fact, by selecting a service (and so its corresponding al-gorithm), it is possible to search for algorithms whichare executable before/after the algorithm at hand, and

then to look in the UDDI registry for services imple-menting them. To this end the Broker interacts with theMatchMaker returning a list of useful services, ranked

according to their cost in the match. Such functional-ity gives users an invaluable help during composition,by supporting novice users to reduce the search space,

hence helping them to find only those services that arecompatible with the one at hand.

In order to provide an advanced support especiallyfor novice users, which have difficulty in choosing veryeffective solutions for their goals, the KDDComposer is

introduced, which is a support service aimed at gener-ating prototypical KDD processes in a semi-automaticfashion. Through the KDDComposer, a user is only

asked to provide a dataset and to specify the task toachieve; the service, by iteratively calling the Match-Maker service, will yield a list of possible KDD pro-

cesses (Diamantini et al 2009b). In order to improve thequality of the results, two parameters are set: the max-imum length of the process in terms of algorithms (max

length), which affects the process execution time; andthe maximum length of a path in the ontology linkingtwo algorithms (max distance), which affects the quality

of a match avoiding to use concepts that are too dis-tant. These parameters reduce the search space, hencealso improve the efficiency of the whole procedure. The

generated processes are prototypical (i.e., abstract) andhence not directly executable, because they are formedof algorithms; their aim is to provide a support, de-

scribing which possible sequences of algorithms may beused to solve the problem at hand. We like to note thatdespite they are no executable, prototype processes are

in themselves new patterns, which are valid (they arebased on the ontology) and potentially useful; hence,a prototype process represents itself a KDD outcome

as defined in (Fayyad et al 1996). At present, KDDDe-signer and KDDComposer are available as independentprototypes, which will be integrated in a future work.

Coming back to our case study, let us suppose that

the user does not know how to proceed after combin-ing K-NN and C4.5. In this case, calling the Match-Maker from the KDDDesigner, candidate services are

shown to the user, like services for performances eval-

A Virtual Mart for Knowledge Discovery in Databases 13

Fig. 8 An example of request to the KDDComposer: processes performing classification task over a labeled dataset withmissing values, nominal attributes and not balanced data.

Fig. 9 A example of prototypical process reported by the KDDComposer service.

uation or tree visualization. In the case the user doesnot have the necessary knowledge to start building asuitable process, the KDDComposer is the right solu-

tion. Figure 8 shows the interface of the KDDCom-poser, where the user has highlighted the classifica-tion task and selected the characteristics of the dataset:

nominal attributes, with missing values, andnot balanced. The max length and max distance havebeen both set to 5, which has experimentally been shown

to produce good performances. Although in the exam-ple the dataset annotation is manually set, we can sup-port such activity by the DBAnalyzer, which is a sup-

port service capable of analyzing a generic dataset andto extract its properties.

An example of resulting process is shown in Figure

9, where the KDDComposer suggests to use in cascade:LDA for selecting the most important features, Spac-eReduction to reduce the dimensionality of the space in

accordance to the outcome of LDA, and MetaCost as

classification algorithm. As one can notice, in this pro-cess no algorithm for rebalancing the dataset is used;in fact, MetaCost is able to transform most of classi-

fication algorithms in cost-sensitive classification algo-rithms (hence, enabling them to manage unbalanceddata). As shown in the bottom left panel, MetaCost

returns as output a decision tree, in fact our implemen-tation of MetaCost works with C4.5 as suggested in theliterature. We like to note that the KDDComposer has

discovered a different approach to the classification ofunbalanced data, which is novel and, probably, more ef-fective. In general, KDDComposer allows users, ranging

from novice to the most expert, to perform a more sys-tematic and effective exploration of possible solutions,while improving their skills.

As any experimental workflow, KDD activities areiterative in nature due to the difficulty to a priori definethe best plan to discover that knowledge. This fact is

recognized in all the existing process models (see e.g.

14 Claudia Diamantini et al.

(Shearer 2000)) by accounting for the need of repeated

backtracking to previous steps and repetition of cer-tain actions: the lessons learned during a step can helprecognizing errors in previous steps or can give knowl-

edge to enhance their outcomes. Backtracking has tobe managed in order to facilitate the comparison of dif-ferent trials. In order to take into account this feature,

the system is equipped with VerMan, a service for themanagement of different versions of the same process.This service interacts with the KDDDesigner by keep-

ing track of the creation and modification of a processby the design team. VerMan also allows the team toattach an annotation explaining the changes. Any ver-

sion of the process is stored in a repository hosted atthe server providing the VerMan service. Functional-ities enabled by VerMan include the browsing of the

versions tree, the retrieval of a specific version and theloading of the corresponding process in the KDDDe-signer. The tree structure underlying VerMan can alsobe exploited for synchronization of different copies gen-

erated during process co-design, hence enabling a multi-synchronous editing environment (Dourish 1995). How-ever, its definition and development are beyond the goal

of the present research work.

4.4 Activation

Once having produced a process made of KDD services,

the user may want to execute it. A process is a workflowof services, which can be executed by means of a work-flow engine that is able to interpret standard workflow

languages (e.g. XPDL, BPMN and BPEL). Processesare currently represented in an internal XML format,and can be exported in a standard workflow language

so that their execution can be managed by the Work-Flow Manager (WFM ). This service extends existingengine for interpreting semantic annotation and spe-

cific KDDVM information. Before the process is exe-cuted, each service is assigned to a user who will beresponsible for its execution; usually users with some

degree of knowledge of the functionalities provided bythe service or of data handled by it. At present wehave developed a prototype WFM that interprets pro-

cesses written in XPDL. Since the KDDDesigner letsusers create a process without completely wiring allthe services’ interfaces, when the execution of a ser-

vice needs the human intervention (e.g. for parameterssetting), the WorkFlow Manager calls the ClientFac-tory. Such a consumer-side application exploits seman-

tic descriptions of the eSAWSDL and generates on-the-fly a graphical interface showing to the user, which hasbeen entrusted with the execution of the service, every

service parameter plus other information taken from

the descriptor. Among them, the valid range of values

(min-max) for each numeric parameter, or the defaultvalue, used if the user does not specify any value. Fig-ure 10 provides an example of interface generated by

the ClientFactory for the setting of parameters; in par-ticular, it is shown the interface for invoking the serviceimplementing the J48weka tool.

The use of ClientFactory during the process execu-tion enables the collaborative work, allowing users to

design processes at different details levels. As a mat-ter of fact, a team can either specify all details neededto execute the process, or fix just the structure of the

process, outsourcing the work of tuning parameters toother users with more specific skills and expertise.

In the described framework, even a process can beannotated in order to support its storing and retrieval.At present we describe a process by means the name, its

authors, the team working on it, date and time of thelast edit, and textual comments, which can be useful incase of later modifications.

5 Conclusions and Future Work

The work has presented KDDVM, a novel frameworkand a platform for KDD support in distributed collabo-

rative environments. The framework organizes the coreset of information deemed as fundamental to managethe complexity of such an environment. The organiza-

tion of information in abstraction layers allows to ex-ploit the suited technologies for its representation andexploitation, moreover each layer uses the upper layer

as a semantic reference space. The exploitation of se-mantic information is widely recognized as the state-of-the-art way towards a new generation of systems with

advanced intelligent functionalities for collaboration inlarge heterogeneous, distributed environments.

The KDDVM framework and platform have beenconceived to satisfy flexibility, transparency, reusabilityand ease-of-use requirements. The satisfaction of flex-

ibility is guaranteed by the choice of developing theplatform as a Service Oriented Architecture. As a mat-ter of fact, the use of Web Service standards facilitates

the integration of new tools in the platform, their dis-covery, composition, and activation. Furthermore, thesemantic representation of data and services enables us

to provide KDDVM with a set of support services thatare conceived to give advanced support while hidingtechnical details, hence satisfying the ease-of-use and

transparency requirements. Among others, the Brokerthat provides standard and semantic driven discoveryfunctionalities, the MatchMaker and the KDDCom-

poser that support the composition of services in valid

A Virtual Mart for Knowledge Discovery in Databases 15

Fig. 10 The interface generated by the ClientFactory for the setting of parameters of the service wrapping the J48weka tool.

and effective processes. Reusability is mainly guaran-teed by the introduction of the Wrapping Service, whichallows users to easily add legacy and third parties tools

non specifically designed for the platform, or even forservice-oriented environments. Besides support services,a set of client interfaces has been developed to facilitate

the use of the platform. Among them, the KDDDe-signer represents an access point to all functionalitiesprovided by KDDVM in a blackboard-like graphical in-

terface, and the ClientFactory that exploits semanticdescriptions contained in the eSAWSDL of the serviceand generates on-the-fly a graphical interface explain-

ing the use of the service.

We like to note that principles adopted in KDDVMdesign greatly improves the perceived quality of an ap-

plication, for several reasons: greater sharing and ac-cessibility to end users, reduced overhead of softwareinstallations, reduced cost to the occasional user of the

software and increased user mobility. As concerns thelimits of the adopted framework, it is to be pointed outthat semantic annotation is critical, in fact, the quality

of the provided support greatly depends on the qualityof the semantic representation of resources. To containthis criticality, it is important that the publisher is a

domain expert with skills on the tool at hand. Further-more, the exploitation of SOA may lead to some criticalfactors, like the possible decrease of privacy and secu-

rity, because data are computed remotely, and the needof sufficient bandwidth to operate with large datasets.Anyway, these aspects can be considered as technical

issues, which can be properly faced by the adoption ofup-to-date standards for secure transmission, and bydefining a proper access control and sizing of hardware.

Finally, the evaluation of the ease-to-use require-

ment deserves special consideration. If on the one hand,

the analysis of users feedbacks can help to evaluate andimprove the whole platform, on the other hand the largeamount of both functionalities provided by the platform

and ways of combining them require that a system-atic usability study has to be accurately designed. Atpresent, the use of the platform has been proposed to a

group of graduate students in computer science, duringtheir first course on KDD. Positive feedbacks have beenreturned about the usability of the semantic enhanced

Broker with respect to traditional web service discov-ery, as it improves the precision and recall of resultsof the search. Also, users have greatly appreciated the

acceleration given by ClientFactory and KDDDesignerto the setting of an experiment, as they avoid errorsand make suggestions in parameters setting and pro-

cess composition, while provide a unique interface forany service. We plan to perform a systematic usabilitystudy as a further refinement step of the research.

In the future we also plan to extend functionalitiesfor the sharing and reuse of workflows, by enriching

their semantic description and by applying a clusteringtechniques we have developed to organize the workflowrepository in groups of similar workflows, to enhancesearch and to discover typical usage patterns. Also, we

plan to launch a widespread experimental campaign toaugment the number of services and of different users,so to stress the heterogeneity of the repository.

References

Ali AS, Rana OF, Taylor IJ (2005) Web services compositionfor distributed data mining. In: Proc. of the InternationalConference on Parallel Processing Workshops, pp 11–18

Alsairafi S, Ghanem M, Giannadakis N, Guo Y, Kalaitzopou-los D, Osmond M, Rowe A, Syed J, Wendel P (2003) Thedesign of discovery net: Towards open grid services for

16 Claudia Diamantini et al.

knowledge discovery. International Journal of High Per-formance Computing Applications 17(3):297–315

Bernstein A, Provost F, Hill S (2005) Toward intelligent as-sistance for a data mining process: An ontology-based ap-proach for cost-sensitive classification. IEEE Transactionson Knowledge and Data Engineering 17(4)

Cannataro M, Talia D (2003) The knowledge grid. CommunACM 46(1):89–93

Cheung WK, Zhang XF, fai Wong H, Liu J, Luo ZW, TongFCH (2006) Service-oriented distributed data mining.IEEE Internet Computing 10:44–54

Comito C, Mastroianni C, Talia D (2006) Metadata, ontolo-gies and information models for resource management ingrid-based pse toolkits. International Journal of Web Ser-vices Research 3(4):52–72

Congiusta A, Talia D, Trunfio P (2008) Service-oriented mid-dleware for distributed data mining on the grid. Journalof Parallel and Distributed Computing 68(1):3–15

De Roure D, Goble CA, Stevens R (2009) The design andrealisation of the myexperiment virtual research environ-ment for social sharing of workflows. Future GenerationComputer Systems 25(5):561–567

Demsar J, Zupan B, Leban G, Curk T (2004) Orange: Fromexperimental machine learning to interactive data mining.In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D(eds) Knowledge Discovery in Databases: PKDD 2004,LNCS, vol 3202, Springer, pp 537–539

Diamantini C, Potena D (2008) Semantic annotation and ser-vices for kdd tools sharing and reuse. In: Proc. of the 8thIEEE International Conference on Data Mining Work-shops. 1st Int. Workshop on Semantic Aspects in DataMining, Pisa, Italy, pp 761–770

Diamantini C, Potena D, Cellini J (2007) Uddi registry forknowledge discovery in databases services. In: Proc. of theInternational Symposium on Collaborative Technologiesand Systems, IEEE, Orlando, FL, USA, pp 321–328

Diamantini C, Potena D, Storti E (2009a) Kddonto: an ontol-ogy for discovery and composition of kdd algorithms. In:Proc. of the ECML/PKDD09 Workshop on Third Gener-ation Data Mining: Towards Service-oriented KnowledgeDiscovery, Bled, Slovenia, pp 13–24

Diamantini C, Potena D, Storti E (2009b) Ontology-drivenkdd process composition. In: Adams Nea (ed) Advancesin Intelligent Data Analysis VIII, Proc. of the 8th Inter-national Symposium on Intelligent Data Analysis, LNCS,vol 5772, Springer, Lyon, France, pp 285–296

Dourish P (1995) The parting of the ways: Divergence, datamanagement and collaborative work. In: Proc. of thefourth European Conference on Computer-Supported Co-operative Work, Stockholm, Sweden, pp 215–230

Farrel J, Lausen H (2007) Semantic annotationsfor wsdl and xml schema, w3c recommendation.http://www.w3.org/TR/sawsdl/

Fayyad UM, Piatetsky-shapiro G, Smyth P (1996) From DataMining to Knowledge Discovery: an Overview, AmericanAssociation for Artificial Intelligence, Menlo Park, CA,USA, pp 1–34

Fernandez-Lopez M, Gomez-Perez A, Juristo N (1997)Methontology: from ontological art towards ontologicalengineering. In: Proc. of the AAAI97 Spring SymposiumSeries on Ontological Engineering, Stanford, USA, pp 33–40

Frank A, Asuncion A (2010) Uci: Machine learning repository.URL http://archive.ics.uci.edu/ml

Gruber TR (1995) Toward principles for the design of ontolo-gies used for knowledge sharing. International Journal of

Human-Computer Studies 43(5-6):907–928Guedes D, Meira W, Ferreira R (2006) Anteater: A service-

oriented architecture for high-performance data mining.IEEE Internet Computing 10:36–43

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P,Witten IH (2009) The weka data mining software: an up-date. SIGKDD Explorations Newsletter 11(1):10–18

Kickinger G, Hofer J, Brezany P, Tjoa AM (2004) Grid knowl-edge discovery processes and an architecture for theircomposition. In: Proc. of the IASTED International Con-ference on Parallel and Distributed Computing and Net-works, Innsbruck, Austria

Kumar A, Kantardzic MM, Ramaswamy P, Sadeghian P(2004) An extensible service oriented distributed datamining framework. In: Proc. of the International Confer-ence on Machine Learning and Applications, pp 256–263

Majithia S, Shields MS, Taylor IJ, Wang I (2004) Triana: Agraphical web service composition and execution toolkit.In: Proc. of IEEE International Conference on Web Ser-vices, pp 514–521

Noy NF, Mcguinness DL (2002) Ontology Development 101:A Guide to Creating Your First Ontology. Stanford Uni-versity

Olejnik R, Fortis TF, Toursel B (2009) Webservices orienteddata mining in knowledge architecture. Future Genera-tion Computer Systems 25(4):436–443

Panov P, Dzeroski S, Soldatova LN (2008) Ontodm: An on-tology of data mining. In: Proc. of the 8th IEEE Int.Conf. on Data Mining Workshops. 1st Int. Workshop onSemantic Aspects in Data Mining, pp 752–760

Park BH, Kargupta H (2003) Distributed Data Mining: Al-gorithms, Systems, and Applications. In: Ye N (ed) Thehandbook of data mining, Routledge, pp 341–358

Pedrinaci C, Domingue J (2010) Web services are dead. longlive internet services. Tech. rep.

Perez MS, Sanchez A, Robles V, Herrero P, Pea JM (2007)Design and implementation of a data mining grid-aware architecture. Future Generation Computer Systems23(1):42–47

Podpecan V, Zakova M, Lavrac N (2010) Workflow construc-tion for service-oriented knowledge discovery. In: Mar-garia T, Steffen B (eds) Leveraging Applications of For-mal Methods, Verification, and Validation, LNCS, vol6415, Springer, pp 313–327

Sarawagi S, Nagaralu SH (2000) Data mining models as ser-vices on the internet. SIGKDD Explor Newsl 2(1):24–28

Serban F, Kietz JU, Bernstein A (2010) An overview of in-telligent data assistants for data analysis. In: Proc. of the3rd Planning to Learn Workshop at ECAI 2010, pp 7–14

Shearer C (2000) The crisp-dm model: The new blueprint fordata mining. Journal of Data Warehousing 5(4):13–22

Treadwell J (2005) Open Grid Services Architecture Glossaryof Terms. GGF Document GFD-I.044.

Tsai CY, Tsai MH (2005) A dynamic web service based datamining process system. In: Proc. of the 5th InternationalConference on Computer and Information Technology, pp1033–1039

Yu-hua L, Zheng-ding L, Xiao-lin S, Kun-mei W, Rui-xuan L(2006) Data mining ontology development for high userusability. Wuhan University Journal of Natural Sciences11(1):51–56

Zhou ZH, Liu XY (2006) Training cost-sensitive neural net-works with methods addressing the class imbalance prob-lem. IEEE Trans on Knowledge and Data Engineering18(1):63–77