[IEEE 2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) - Timisoara, Romania (2013.09.23-2013.09.26)] 2013 15th International Symposium

Ontology-based Recommender for DistributedMachine Learning Environment

Daniel Pop, Caius Bogdanescu

Faculty of Mathematics and Computer Science

West University of Timisoara

Timisora, Romania

Email: {danielpop, caius.bogdanescu}@info.uvt.ro

Abstract—Domain experts in different areas have a largenumber of options for approaching their specific data analysisproblem. In exploration of large data sets on HPC systems,choosing which method to use, or how to tune the parameters ofan algorithm to achieve good results are challenging tasks for dataanalysts themselves. In this paper, we propose a recommendationmodule for a distributed machine learning environment aimingat helping the end-users to obtain optimized results for their dataanalysis / machine learning problem.

I. INTRODUCTION

Given the enormous growth of collected and available data

in companies, industry and science, techniques for analyzing

such data are becoming ever more important. Today, data

to be analyzed is no longer restricted to sensor data and

classical databases, but more and more it includes textual doc-

uments and webpages (text mining, Web mining), spatial data,

multimedia data, linked data (molecules, social networks).

Analytics tools allow end-users to harvest the meaningful

patterns buried in large volumes of structured and unstructured

data and analyzing big datasets gives users the power to

identify new revenue sources, develop loyal and profitable

customer relationships, and run organizations more efficiently

and cost effectively.

Research in knowledge discovery and machine learning

combines classical questions of computer science (efficient al-

gorithms, software systems, databases) with elements from ar-

tificial intelligence and statistics up to user oriented issues (vi-

sualization, interactive mining). Traditional, relational-model

oriented approaches, such as Teradata, Oracle or Netezza are

providing means to realize parallel implementations of ML-

DM (Machine-Learning Data Mining) algorithms, but there

are few issues with this line of work: (1) expressing ML-

DM algorithms in SQL code is a complex task and difficult

to maintain; (2) large-scale installations of these products are

expensive; (3) nature of data is shifting away from structured

to un- (or semi-) structured data 1. Exploration of large,

unstructured, data sets on HPC systems is enabled by emergent

technologies (NoSQL data stores, MapReduce and distributed

file systems) that generated novel approaches and solutions

1While structured data is following a near-linear growth, unstructured (e.g.audio and video) and semi-structured data (e.g, Web traffic data, social mediacontent, sensor generated data etc.) exhibit an exponential growth. Source:IDC Digital Universe 2009

to machine learning and data mining problems. In a recent

report [1], authors review the most recent developments in

this field, categorizing them in 5 distinct classes and point out

some of their common limitations:

• lack of responsiveness – most systems does not offer end-

users feedback concerning the progress of the launched

tasks,

• lack of adequate customization – surveyed systems either

target expert users for whom they offer low level details

and parametrization capabilities, or novice users who are

left with no tweaking possibilities, all parameters being

magically tuned by the system,

• lack of recommendations – derived from the same polar-

ization described above, end-users are not supported by

intelligent, self-learning systems able to offer recommen-

dations for problems to be solved.

Our current work is focusing on the architecture, design and

implementation of a scalable, easy to use and deploy solution

for ML-DM in the context of distributed computing paradigm,

targeting end-users with less programming or statistical expe-

rience, but willing to run and tweak advanced scientific ML

tasks, such as researchers and practitioners from fields like

medicine, financial, telecommunications etc. It is a distributed

system relying on existing distributed ML-DM frameworks,

but enhancing them with user-centric features. The focus of

this paper is the recommendation module, designed to support

end-users in creating a data mining scenario.

The remaining of the paper is structured as follows: next

section briefly presents similar works and how our prototype

relates with them, followed by section III describing a typical

usage scenario of the application. Overall architecture of the

system is addressed in the IV-th section. Section V is devoted

to in-depth presentation of recommendation module, while last

section presents the most important findings learned so far

during the work on this prototype and our future plans.

II. RELATED WORKS

Existing programming paradigms for expressing large-scale

parallelism such as MapReduce (MR) and the Message Pass-

ing Interface (MPI) are de facto choices for implementing

ML-DM algorithms. More and more interest has been devoted

to MR due to its ability to handle large datasets and built-

in resilience against failures. There are several recent papers

2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

978-1-4799-3035-7/14 $31.00 © 2014 IEEE

DOI 10.1109/SYNASC.2013.76

539

2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

978-1-4799-3035-7/14 $31.00 © 2014 IEEE

DOI 10.1109/SYNASC.2013.76

537

and books overviewing distributed and cloud-based solutions,

such as [2], [3], [4], [5], [1]. In [1], the author classifies them

in five categories: ML environments from the cloud, plugins

for ML tools, distributed ML libraries, complex ML systems

and SaaS providers for ML. Our work is related to last three

categories by using distributed ML libraries to build a complex

ML system whose services are offered based on a SaaS model.

Distributed ML libraries, such as Apache MahoutTM 2 [6],

GraphLab 3 [7], DryadLINQ 4 [8], [9], Jubatus 5 [10], NIM-

BLE [11] or SystemML [12] are collections of complex ML

methods and algorithms running on different distributed setups

(Hadoop, Dryad, SciDB, MPI). They allow users to use out-

of-the-box algorithms, or implement their own, that are run

in parallel mode over a cluster of computers. These solutions

does not integrate, nor use, statistics/mathematics software,

rather they offer self-contained packages of optimised, state-

of-the-art ML-DM methods and algorithms.

In our system, we are going to integrate Apache Mahout

as ”provider” of ML algorithms because of its advantages

comparing to other libraries: open-source, technology com-

patibility with our approach (Apache Hadoop [13] and Java),

an active and increasing community contributing to a steady

evolution of the library, and easy to use and integrate.

Complex ML systems category encompass solutions for

business intelligence and data analytics that are deployable

either on on-premise or in-the-cloud clusters, provide a rich

set of graphical tools to analyse, explore and visualize large

amounts of data, and utilize Hadoop as processing engine and

storage environment. There are differences on how data is

integrated and processed, supported data sources, or related

to complexity of the system. Among the most popular we can

mention Kitenga Analytics [14], Pentaho Business Analytics 6,

Platfora 7, Skytree Server 8, InsightsOne 9, Kognitio 10 or

Wibidata 11 [15]. In addition to these commercial solutions,

there are also software platforms developed by the academic

research groups, such as the Data Mining Cloud Frame-

work [16], a software framework that enables the definition

and execution of KDD workflows of services (e.g. model

creation service, data storage service etc.) on top of Cloud

computing and storage environments.

Our vision is to build a system that shall fall into this cate-

gory, but being open-source and more focused on knowledge

discovery and ML techniques rather than advanced business

analytics, which is already well covered by existing systems.

Software as a Service providers for ML are offering their

services using rich graphical user interface, and/or RESTful

APIs for programmatic access to services. Hosting platform

is usually powered by cloud providers, but and in some

(rare) cases the solution may also be installed on-premise

2http://mahout.apache.org3http://graphlab.org4http://research.microsoft.com/en-us/projects/DryadLINQ/5http://jubat.us/6http://www.pentaho.org7http://platfora.com8http://skytree.net9http://insightsone.com/10http://kognitio.com11http://wibidata.com

(Myrrix 12). They mainly provide predictive modelling meth-

ods, such as BigML 13, Google Prediction API 14, Eigendog 15.

We did not consider here providers of SQL over Hadoop solu-

tions (e.g. Cloudera Impala, Hadapt, Hive) because their main

target is not ML-DM, rather fast, elastic and scalable SQL

processing of relational data using the distributed architecture

of Hadoop.

This last class of products inspired us in the way we are

offering our services, i.e. using SaaS model.

In [17] authors present a distributed ML system that also

aims at making ML-DM more accessible to non-experts. A

ML task is transformed into a learning plan that is further

processed by an optimizer that tries to find firstly a quality

answer to the user, and then, in parallel, to improve the result

iteratively in the background.

Modelling DM domain (data types, algorithms, models etc)

using ontologies has been addressed in OntoDM [18] project

for past 5 years. OntoDM is a deep ontology for data mining

that includes definitions of basic data mining entities, such as

data type and dataset, data mining task, data mining algorithm

and components thereof (e.g., distance function). Our ontology

(ML-Ontology) is more focused in expressing constraints in

data mining and functional rules, which forms the core of

our recommendation module. It also defines the common

vocabulary between different components of the system.

III. TYPICAL USAGE SCENARIO

In this section we will describe a simple usage scenario of

our prototype. We’ll start saying that the end-user’s interaction

with the system is completely managed by the UI service that

acts as a graphical interface for all other services of the system,

although these are also exposed to programmers and other

consumers via RESTful APIs.

In the first step, the end-user has to select a data source

with the help of services offered by the Data Source Manager

(all services are briefly described in section IV). Next, the

user selects the type of ML model he/she wants to build

out of the dataset. For example, the user wants to build a

predictive model for a classification task. At this point, the

recommendation module will retrieve the meta-data associated

with the dataset and it will recommend the next steps of the

process. For example, perform a normalization of numerical

attributes using Standard score, followed by the training of a

SVM classifier using SMO (Sequential Minimal Optimization)

algorithm. User may accept the suggestions and adjust the pa-

rameters accordingly, or users may choose a different method

than the suggested one. Figure 1 shows the user interface of the

system at this point. In the next phase, the execution plan for

our problem is submitted to the Execution subsystem, more

precisely to Hadoop Job Submitter service, that will further

schedule and monitor its execution on the cluster.

Next section will briefly presents the services of the system

and their roles.

12http://myrrix.com13http://bigml.com14https://developers.google.com/prediction/15https://eigendog.com/#home

540538

Figure 2. Overall architecture of the system

Figure 1. Snapshot of user interface

IV. ARCHITECTURE OF THE SYSTEM

Our system is conceived as a Service Oriented Architecture,

inter-service communication being handled via RESTful APIs.

The system is split into two subsystems with complementary

responsibilities: Problem Definition subsystem and Execution

subsystem (see Figure 2), plus the UI (User Interface) service

that facilitates the end-users’ interaction with the system.

Problem Definition subsystem offers its functionality with

the help of the following services:

(1) Data Source Manager – is responsible for defining

sources of data in the system. Actors (end-users, or other

subsystems) are able to create a new data source, or

to remove an existing data source. A new data source

can be created either based on a user dataset that can

be uploaded from local computer on the cloud, or by

linking to a dataset already available at an URL (e.g.

http://, s3://, azure:// etc). The data is stored on Hadoop

Distributed File System (HDFS)16, managed by the Data

Storage service.

(2) Data Storage – is responsible for accessing data on

storage infrastructure. In the first prototype of the sys-

tem, we are targeting a HDFS installation. Data Storage

service offers an optimized, easy to use interface to

store or retrieve datasets or results produced during the

execution of machine learning processes.

(3) Meta-data Storage – is responsible to store and access

meta-data about data sources and it is implemented using

a document database, such as MongoDB.

(4) Recommendation Module – proposes to end-users var-

ious recommendations in terms of appropriate learning

methods to use, or tweaking algorithms’ parameters for a

quicker convergence for each new data set. More details

about this service are given in section V.

Execution subsystem is composed of the following web

services:

(1) Hadoop Job Submitter – is responsible for monitoring

and scheduling jobs over multiple Apache Hadoop in-

stallations. In a generalized scenario, our system is able

to manage two or more installations of Apache Hadoop

framework, for example some of them available from

16http://hadoop.apache.org/docs/r1.0.4/hdfs design.html

541539

public cloud providers, while some others are available

on-premises data centers. Each installation is managed

by a Hadoop Job Controller that interacts with Hadoop

Job Submitter.

(2) Hadoop Job Controller – is a wrapper servicing each

Hadoop cluster, and it monitors and controls (stop,

resume, start) jobs on the cluster. Monitoring is imple-

mented using a predefined set of Hadoop counters. The

service is able to deliver these data to other services, for

example Hadoop Job Submitter.

The User Interface service (Figure 1) supports end-users’

interaction with the system. It delivers all the services offered

by the system in an easy, accessible, clear way to end-users.

Essentially, it is a Web application developed using a mix

of client- and server-side technologies following the MVC

(Model-View-Controller) architectural pattern.

V. RECOMMENDATION MODULE

The purpose of recommendation module is to assist end-

users in the process of designing their ML-DM task. It offers

recommendations for selecting the most appropriate methods

for a task, or what algorithm will produce the results closer

to user’s needs. It is a rule-based expert system modelled as

an ontology augmented with SWRL-based rules.

The ML-Ontology, described in next sub-section, uses the

concept of problem to represent a DM type of problem (such

as classification, clustering or association analysis), or a task

(such as normalization, or building the model) or tweaking the

parameters of an algorithm. For each of these problems the

system generates the recommendations based on the current

evaluation context that is composed of meta-attributes of the

dataset under analysis, plus end-user’s preferences.

For each dataset, a set of meta-attributes are defined, such

as:

• number of attributes (variables)

• number of instances (examples)

• missing values ratio (= number of instances with missing

values / total number of instances)

• discrete attributes ratio (= number of discrete attributes /

total number of attributes)

For structured data, these indicators can be computed, more

or less, automatically. For very large structured datasets, they

are estimated on a randomly selected subset of the original

dataset, since these estimates should suffice for recommenda-

tion purpose. In case of semi-structured or un-structured data

the meta-attributes are to be provided by end-users.

End-users preferences, on the other side, are all given inputs

by end-user. For example, in case of a classification task we

can consider the following parameters (all these have range

domains from 1 - lowest to 5 - highest):

• readability of the model – how easy is for humans to

interpret the inferred model; some predictive models,

such as classification tree, are clearer comparing to others

(e.g. a neural network)

• speed of learning – ability to adjust the model learning

effort

• accuracy – ability to adjust the model accuracy

How does recommender module works? Firstly, the meta-

attributes of the dataset are retrieved using Meta-Data Storage

services; next, this information together with end-user’s inputs

are fed into the ontological reasoner that using the ML-

Ontology infers appropriate suggestions for the next step;

these inferred suggestions are consumed by the UI service that

guides the user in the process of constructing a DM-ML task.

The recommendations (see next section for some examples)

define the next steps of the process (e.g. a normalization

followed by constructing the model), or the list of possible

values for a user input (e.g. what algorithms are available for

a classification tree construction). This process is iterated until

no more steps left. The recommendation module exposes its

functionality via a RESTful API that is used by other services

of the system (e.g. UI component).

A. ML-Ontology

The ML-Ontology is the core component of the recom-

mendation module. It is built following a problem decompo-

sition strategy, where a problem (represented by Problemclass) either represents a type of DM problem (such as

classification, clustering or association analysis), a method

(such as normalization, or building a model), or an algo-

rithm. Complex problems are broken into sub-problems using

hasChildren object property. Specializations of Problemclass – Method, Algorithm, *Preferences

17 – are

used to distinguish between different types of problems. Fig-

ure 3 shows the main concepts of the ontology and their indi-

viduals. For example, classification, clustering,buildModel or normalization are all individuals of

type Problem or its specializations.

The problem decomposition is implemented using

hasChildren generic object property. For example,

classification individual, which represents a standard

classification task, is in hasChildren relation with

normalization, buildModel, and validateModelindividuals.

In order to represent different ’solutions’ of a problem, for

example method_classificationTree is implemented

by alg_C4.5, alg_J4.8, alg_RandomForest or

alg_ADT algorithms, the object property hasOptionshas been introduced with its subtypes: implementedBy,solvedBy and recommend. Figure 4 presents the details

of alg_SMO algorithm. One can note that this algorithm

is a SVM learner (method_SVM is implementedBy this

individual), it has three parameters represented as data prop-

erty assertions (param_cacheSize, param_epsilon,param_buildLogisticModels) and it is exposedByWeka library. A Library is a collection of ML algorithms,

and each algorithm may be implemented by two or more ML

libraries.

The recommender system relies on ML-ontology ability to

infer new facts (reasoning), but characteristics of the dataset

and end-user preferences need to be considered as well. In our

17* stands for Classification, Clustering or other type of DM-ML problems

542540

Figure 3. ML-Ontology: concepts and individuals

approach, we use a set of rules expressed in Semantic Web

Rule Language (SWRL) [19], built on top of static individuals

and relations described so far, to ensure context-aware rea-

soning. The expert knowledge encoded into these rules was

derived based on some empirical studies, such as [20]. For

example, Figure 5 shows the SWRL encoding of the following

rule: ”for a classification problem, if speed of classification is

important, readability of the model is important as well and

the dataset contains both numerical and discrete attributes thenrecommended classifier is Classification Tree”. recommendis an object property, a subtype of hasOptions, used to

represent facts infered by SWRL rules.

The problem decomposition approach allows us to uni-

formly handle recommendations for tasks, methods, or algo-

rithms.

VI. CONCLUSIONS AND FUTURE PLANS

This paper presents the overall architecture of a distributed

system for machine learning / data mining problems that com-

Figure 4. ML-Ontology: algorithm details

ClassificationPreferences(?m),Dataset(?d),numberOfContinuousAttributes(?d, ?noca),numberOfDiscreteAttributes(?d, ?noda),speedOfClassification(?m, ?soc),readability(?m, ?r),integer[>= 4](?soc),integer[>= 4](?r),integer[> 0](?noca),integer[> 0](?noda)->recommend(buildModel,

method_classificationTree)

Figure 5. Method recommendation

bines the semantics encoded in ML-Ontology with distributed

processing frameworks, such as Apache Hadoop. The focus of

the paper is on the recommendation module, whose purpose

is to support and guide end-users in their ML-DM scenarios.

The recommendation module is modeled as a rule-based

system. We are planning to keep track of end-user feedback for

proposed suggestions. As long as the system will accumulate

more and more end-users’ feedback, we will be considering

changing the recommendation model from expert mode to a

collaborative filtering approach.

In the future, we will continue the development of remaining

services of the system and we will run experiments and real-

life use cases to evaluate its performance and overall user

acceptance level.

ACKNOWLEDGMENTS

This work was supported by EC-FP7 project FP7-REGPOT-

2011-1 284595 (HOST).

543541

REFERENCES

[1] D. Pop, “Machine learning and cloud computing: Survey of distributedand saas solutions,” Institute e-Austria Timisoara, Tech. Rep. 2012-1,December 2012.

[2] R. Bekkerman, M. Bilenko, and J. Langford, Eds., Scaling up MachineLearning. Cambridge University Press, 2012. [Online]. Available:http://people.cs.umass.edu/˜ronb/scaling up machine learning.htm

[3] S. Charrington, “Three new tools bring machine learning insightsto the masses,” February 2012, read Write Web. [Online].Available: http://www.readwriteweb.com/hack/2012/02/three-new-tools-bring-machine.php

[4] W. Eckerson, “New technologies for bigdata,” 2012. [Online]. Available: http://www.b-eye-network.com/blogs/eckerson/archives/2012/11/new technologie.php

[5] D. Harris, “5 low-profile startups that could change the face of bigdata,” 2012. [Online]. Available: http://gigaom.com/cloud/5-low-profile-startups-that-could-change-the-face-of-big-data/

[6] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action.Manning Publications, 2011.

[7] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.Hellerstein, “Distributed graphlab: A framework for machine learningand data mining in the cloud,” in Proc. of the 38th Intl. Conf. on VeryLarge Databases (VLDB), Vol. 5, No. 8, August 2012.

[8] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, andJ. Currey, “Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language,” in Proc. of the 8thUSENIX conference on Operating systems design and implementation(OSDI). USENIX Association Berkeley, 2008, pp. 1–14.

[9] M. Budiu, D. Fetterly, M. Isard, F. McSherry, and Y. Yu,“Large-scale machine learning using dryadlinq,” in Scaling upMachine Learning, R. Bekkerman, M. Bilenko, and J. Langford,Eds. Cambridge University Press, 2012. [Online]. Available:http://people.cs.umass.edu/˜ronb/scaling up machine learning.htm

[10] S. Hido, “Jubatus: Distributed online machine learningframework for big data,” in Proc. of the 1st Ex-tremely Large Databases (XLDB) Asia, 2012. [Online].Available: http://www.slideshare.net/JubatusOfficial/distributed-online-machine-learning-framework-for-big-data

[11] A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan, “Nimble: Atoolkit for the implementation of parallel data mining and machinelearning algorithms on mapreduce,” in Proc. of the 17th ACM SIGKDDConference on Knowledge Discovery and Data Mining (KDD), 2011.

[12] A. Ghoting and et al., “Systemml: Declarative machine learning onmapreduce,” in Proc. of the IEEE 27th International Conference onData Engineering (ICDE), 2011, pp. 231–242.

[13] “Apache hadoop website,” 2012, http://hadoop.apache.org.[14] “Kitenga analytics,” 2013, http://www.quest.com/news-release/quest-

software-expands-its-big-data-solution-with-new-hadoop-ce-102012-818658.aspx.

[15] “Wibidata how it works,” 2012, http://www.wibidata.com/product/how-it-works/.

[16] F. Marozzo, D. Talia, and P. Trunfio, “Using clouds for scalable knowl-edge discovery applications,” in Euro-Par 2012: Parallel ProcessingWorkshops, ser. Lecture Notes in Computer Science, I. Caragiannis,M. Alexander, R. Badia, M. Cannataro, A. Costan, M. Danelutto,F. Desprez, B. Krammer, J. Sahuquillo, S. Scott, and J. Weidendorfer,Eds. Springer Berlin Heidelberg, 2013, vol. 7640.

[17] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, and M. Jor-dan, “Mlbase: A distributed machine-learning system,” in Proceedingsof 6th Biennial Conference on Innovative Data Systems Research, ser.CIDR ’13, Asilomar, CA, USA, January 2013.

[18] P. Panov, S. Dzeroski, and L. Soldatova, “Ontodm: An ontology of datamining,” in Proceedings of the 2008 IEEE International Conferenceon Data Mining Workshops, ser. ICDMW ’08. Washington, DC,USA: IEEE Computer Society, 2008, pp. 752–760. [Online]. Available:http://dx.doi.org/10.1109/ICDMW.2008.62

[19] I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, andM. Dean, “SWRL: A Semantic Web Rule Language Combining OWLand RuleML,” W3C Member Submission, World Wide Web Consortium,Tech. Rep., May 2004.

[20] J. Taylor, “Machine learning in the open,” 2012, open Source Bridge.[Online]. Available: http://prezi.com/ish8cqhhiuuc/machine-learning-in-the-open/

544542

Documents

[IEEE 2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) - Timisoara, Romania (2013.09.23-2013.09.26)] 2013 15th International Symposium