Upload
caius
View
212
Download
0
Embed Size (px)
Citation preview
Ontology-based Recommender for DistributedMachine Learning Environment
Daniel Pop, Caius Bogdanescu
Faculty of Mathematics and Computer Science
West University of Timisoara
Timisora, Romania
Email: {danielpop, caius.bogdanescu}@info.uvt.ro
Abstract—Domain experts in different areas have a largenumber of options for approaching their specific data analysisproblem. In exploration of large data sets on HPC systems,choosing which method to use, or how to tune the parameters ofan algorithm to achieve good results are challenging tasks for dataanalysts themselves. In this paper, we propose a recommendationmodule for a distributed machine learning environment aimingat helping the end-users to obtain optimized results for their dataanalysis / machine learning problem.
I. INTRODUCTION
Given the enormous growth of collected and available data
in companies, industry and science, techniques for analyzing
such data are becoming ever more important. Today, data
to be analyzed is no longer restricted to sensor data and
classical databases, but more and more it includes textual doc-
uments and webpages (text mining, Web mining), spatial data,
multimedia data, linked data (molecules, social networks).
Analytics tools allow end-users to harvest the meaningful
patterns buried in large volumes of structured and unstructured
data and analyzing big datasets gives users the power to
identify new revenue sources, develop loyal and profitable
customer relationships, and run organizations more efficiently
and cost effectively.
Research in knowledge discovery and machine learning
combines classical questions of computer science (efficient al-
gorithms, software systems, databases) with elements from ar-
tificial intelligence and statistics up to user oriented issues (vi-
sualization, interactive mining). Traditional, relational-model
oriented approaches, such as Teradata, Oracle or Netezza are
providing means to realize parallel implementations of ML-
DM (Machine-Learning Data Mining) algorithms, but there
are few issues with this line of work: (1) expressing ML-
DM algorithms in SQL code is a complex task and difficult
to maintain; (2) large-scale installations of these products are
expensive; (3) nature of data is shifting away from structured
to un- (or semi-) structured data 1. Exploration of large,
unstructured, data sets on HPC systems is enabled by emergent
technologies (NoSQL data stores, MapReduce and distributed
file systems) that generated novel approaches and solutions
1While structured data is following a near-linear growth, unstructured (e.g.audio and video) and semi-structured data (e.g, Web traffic data, social mediacontent, sensor generated data etc.) exhibit an exponential growth. Source:IDC Digital Universe 2009
to machine learning and data mining problems. In a recent
report [1], authors review the most recent developments in
this field, categorizing them in 5 distinct classes and point out
some of their common limitations:
• lack of responsiveness – most systems does not offer end-
users feedback concerning the progress of the launched
tasks,
• lack of adequate customization – surveyed systems either
target expert users for whom they offer low level details
and parametrization capabilities, or novice users who are
left with no tweaking possibilities, all parameters being
magically tuned by the system,
• lack of recommendations – derived from the same polar-
ization described above, end-users are not supported by
intelligent, self-learning systems able to offer recommen-
dations for problems to be solved.
Our current work is focusing on the architecture, design and
implementation of a scalable, easy to use and deploy solution
for ML-DM in the context of distributed computing paradigm,
targeting end-users with less programming or statistical expe-
rience, but willing to run and tweak advanced scientific ML
tasks, such as researchers and practitioners from fields like
medicine, financial, telecommunications etc. It is a distributed
system relying on existing distributed ML-DM frameworks,
but enhancing them with user-centric features. The focus of
this paper is the recommendation module, designed to support
end-users in creating a data mining scenario.
The remaining of the paper is structured as follows: next
section briefly presents similar works and how our prototype
relates with them, followed by section III describing a typical
usage scenario of the application. Overall architecture of the
system is addressed in the IV-th section. Section V is devoted
to in-depth presentation of recommendation module, while last
section presents the most important findings learned so far
during the work on this prototype and our future plans.
II. RELATED WORKS
Existing programming paradigms for expressing large-scale
parallelism such as MapReduce (MR) and the Message Pass-
ing Interface (MPI) are de facto choices for implementing
ML-DM algorithms. More and more interest has been devoted
to MR due to its ability to handle large datasets and built-
in resilience against failures. There are several recent papers
2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
978-1-4799-3035-7/14 $31.00 © 2014 IEEE
DOI 10.1109/SYNASC.2013.76
539
2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
978-1-4799-3035-7/14 $31.00 © 2014 IEEE
DOI 10.1109/SYNASC.2013.76
537
and books overviewing distributed and cloud-based solutions,
such as [2], [3], [4], [5], [1]. In [1], the author classifies them
in five categories: ML environments from the cloud, plugins
for ML tools, distributed ML libraries, complex ML systems
and SaaS providers for ML. Our work is related to last three
categories by using distributed ML libraries to build a complex
ML system whose services are offered based on a SaaS model.
Distributed ML libraries, such as Apache MahoutTM 2 [6],
GraphLab 3 [7], DryadLINQ 4 [8], [9], Jubatus 5 [10], NIM-
BLE [11] or SystemML [12] are collections of complex ML
methods and algorithms running on different distributed setups
(Hadoop, Dryad, SciDB, MPI). They allow users to use out-
of-the-box algorithms, or implement their own, that are run
in parallel mode over a cluster of computers. These solutions
does not integrate, nor use, statistics/mathematics software,
rather they offer self-contained packages of optimised, state-
of-the-art ML-DM methods and algorithms.
In our system, we are going to integrate Apache Mahout
as ”provider” of ML algorithms because of its advantages
comparing to other libraries: open-source, technology com-
patibility with our approach (Apache Hadoop [13] and Java),
an active and increasing community contributing to a steady
evolution of the library, and easy to use and integrate.
Complex ML systems category encompass solutions for
business intelligence and data analytics that are deployable
either on on-premise or in-the-cloud clusters, provide a rich
set of graphical tools to analyse, explore and visualize large
amounts of data, and utilize Hadoop as processing engine and
storage environment. There are differences on how data is
integrated and processed, supported data sources, or related
to complexity of the system. Among the most popular we can
mention Kitenga Analytics [14], Pentaho Business Analytics 6,
Platfora 7, Skytree Server 8, InsightsOne 9, Kognitio 10 or
Wibidata 11 [15]. In addition to these commercial solutions,
there are also software platforms developed by the academic
research groups, such as the Data Mining Cloud Frame-
work [16], a software framework that enables the definition
and execution of KDD workflows of services (e.g. model
creation service, data storage service etc.) on top of Cloud
computing and storage environments.
Our vision is to build a system that shall fall into this cate-
gory, but being open-source and more focused on knowledge
discovery and ML techniques rather than advanced business
analytics, which is already well covered by existing systems.
Software as a Service providers for ML are offering their
services using rich graphical user interface, and/or RESTful
APIs for programmatic access to services. Hosting platform
is usually powered by cloud providers, but and in some
(rare) cases the solution may also be installed on-premise
2http://mahout.apache.org3http://graphlab.org4http://research.microsoft.com/en-us/projects/DryadLINQ/5http://jubat.us/6http://www.pentaho.org7http://platfora.com8http://skytree.net9http://insightsone.com/10http://kognitio.com11http://wibidata.com
(Myrrix 12). They mainly provide predictive modelling meth-
ods, such as BigML 13, Google Prediction API 14, Eigendog 15.
We did not consider here providers of SQL over Hadoop solu-
tions (e.g. Cloudera Impala, Hadapt, Hive) because their main
target is not ML-DM, rather fast, elastic and scalable SQL
processing of relational data using the distributed architecture
of Hadoop.
This last class of products inspired us in the way we are
offering our services, i.e. using SaaS model.
In [17] authors present a distributed ML system that also
aims at making ML-DM more accessible to non-experts. A
ML task is transformed into a learning plan that is further
processed by an optimizer that tries to find firstly a quality
answer to the user, and then, in parallel, to improve the result
iteratively in the background.
Modelling DM domain (data types, algorithms, models etc)
using ontologies has been addressed in OntoDM [18] project
for past 5 years. OntoDM is a deep ontology for data mining
that includes definitions of basic data mining entities, such as
data type and dataset, data mining task, data mining algorithm
and components thereof (e.g., distance function). Our ontology
(ML-Ontology) is more focused in expressing constraints in
data mining and functional rules, which forms the core of
our recommendation module. It also defines the common
vocabulary between different components of the system.
III. TYPICAL USAGE SCENARIO
In this section we will describe a simple usage scenario of
our prototype. We’ll start saying that the end-user’s interaction
with the system is completely managed by the UI service that
acts as a graphical interface for all other services of the system,
although these are also exposed to programmers and other
consumers via RESTful APIs.
In the first step, the end-user has to select a data source
with the help of services offered by the Data Source Manager
(all services are briefly described in section IV). Next, the
user selects the type of ML model he/she wants to build
out of the dataset. For example, the user wants to build a
predictive model for a classification task. At this point, the
recommendation module will retrieve the meta-data associated
with the dataset and it will recommend the next steps of the
process. For example, perform a normalization of numerical
attributes using Standard score, followed by the training of a
SVM classifier using SMO (Sequential Minimal Optimization)
algorithm. User may accept the suggestions and adjust the pa-
rameters accordingly, or users may choose a different method
than the suggested one. Figure 1 shows the user interface of the
system at this point. In the next phase, the execution plan for
our problem is submitted to the Execution subsystem, more
precisely to Hadoop Job Submitter service, that will further
schedule and monitor its execution on the cluster.
Next section will briefly presents the services of the system
and their roles.
12http://myrrix.com13http://bigml.com14https://developers.google.com/prediction/15https://eigendog.com/#home
540538
Figure 2. Overall architecture of the system
Figure 1. Snapshot of user interface
IV. ARCHITECTURE OF THE SYSTEM
Our system is conceived as a Service Oriented Architecture,
inter-service communication being handled via RESTful APIs.
The system is split into two subsystems with complementary
responsibilities: Problem Definition subsystem and Execution
subsystem (see Figure 2), plus the UI (User Interface) service
that facilitates the end-users’ interaction with the system.
Problem Definition subsystem offers its functionality with
the help of the following services:
(1) Data Source Manager – is responsible for defining
sources of data in the system. Actors (end-users, or other
subsystems) are able to create a new data source, or
to remove an existing data source. A new data source
can be created either based on a user dataset that can
be uploaded from local computer on the cloud, or by
linking to a dataset already available at an URL (e.g.
http://, s3://, azure:// etc). The data is stored on Hadoop
Distributed File System (HDFS)16, managed by the Data
Storage service.
(2) Data Storage – is responsible for accessing data on
storage infrastructure. In the first prototype of the sys-
tem, we are targeting a HDFS installation. Data Storage
service offers an optimized, easy to use interface to
store or retrieve datasets or results produced during the
execution of machine learning processes.
(3) Meta-data Storage – is responsible to store and access
meta-data about data sources and it is implemented using
a document database, such as MongoDB.
(4) Recommendation Module – proposes to end-users var-
ious recommendations in terms of appropriate learning
methods to use, or tweaking algorithms’ parameters for a
quicker convergence for each new data set. More details
about this service are given in section V.
Execution subsystem is composed of the following web
services:
(1) Hadoop Job Submitter – is responsible for monitoring
and scheduling jobs over multiple Apache Hadoop in-
stallations. In a generalized scenario, our system is able
to manage two or more installations of Apache Hadoop
framework, for example some of them available from
16http://hadoop.apache.org/docs/r1.0.4/hdfs design.html
541539
public cloud providers, while some others are available
on-premises data centers. Each installation is managed
by a Hadoop Job Controller that interacts with Hadoop
Job Submitter.
(2) Hadoop Job Controller – is a wrapper servicing each
Hadoop cluster, and it monitors and controls (stop,
resume, start) jobs on the cluster. Monitoring is imple-
mented using a predefined set of Hadoop counters. The
service is able to deliver these data to other services, for
example Hadoop Job Submitter.
The User Interface service (Figure 1) supports end-users’
interaction with the system. It delivers all the services offered
by the system in an easy, accessible, clear way to end-users.
Essentially, it is a Web application developed using a mix
of client- and server-side technologies following the MVC
(Model-View-Controller) architectural pattern.
V. RECOMMENDATION MODULE
The purpose of recommendation module is to assist end-
users in the process of designing their ML-DM task. It offers
recommendations for selecting the most appropriate methods
for a task, or what algorithm will produce the results closer
to user’s needs. It is a rule-based expert system modelled as
an ontology augmented with SWRL-based rules.
The ML-Ontology, described in next sub-section, uses the
concept of problem to represent a DM type of problem (such
as classification, clustering or association analysis), or a task
(such as normalization, or building the model) or tweaking the
parameters of an algorithm. For each of these problems the
system generates the recommendations based on the current
evaluation context that is composed of meta-attributes of the
dataset under analysis, plus end-user’s preferences.
For each dataset, a set of meta-attributes are defined, such
as:
• number of attributes (variables)
• number of instances (examples)
• missing values ratio (= number of instances with missing
values / total number of instances)
• discrete attributes ratio (= number of discrete attributes /
total number of attributes)
For structured data, these indicators can be computed, more
or less, automatically. For very large structured datasets, they
are estimated on a randomly selected subset of the original
dataset, since these estimates should suffice for recommenda-
tion purpose. In case of semi-structured or un-structured data
the meta-attributes are to be provided by end-users.
End-users preferences, on the other side, are all given inputs
by end-user. For example, in case of a classification task we
can consider the following parameters (all these have range
domains from 1 - lowest to 5 - highest):
• readability of the model – how easy is for humans to
interpret the inferred model; some predictive models,
such as classification tree, are clearer comparing to others
(e.g. a neural network)
• speed of learning – ability to adjust the model learning
effort
• accuracy – ability to adjust the model accuracy
How does recommender module works? Firstly, the meta-
attributes of the dataset are retrieved using Meta-Data Storage
services; next, this information together with end-user’s inputs
are fed into the ontological reasoner that using the ML-
Ontology infers appropriate suggestions for the next step;
these inferred suggestions are consumed by the UI service that
guides the user in the process of constructing a DM-ML task.
The recommendations (see next section for some examples)
define the next steps of the process (e.g. a normalization
followed by constructing the model), or the list of possible
values for a user input (e.g. what algorithms are available for
a classification tree construction). This process is iterated until
no more steps left. The recommendation module exposes its
functionality via a RESTful API that is used by other services
of the system (e.g. UI component).
A. ML-Ontology
The ML-Ontology is the core component of the recom-
mendation module. It is built following a problem decompo-
sition strategy, where a problem (represented by Problemclass) either represents a type of DM problem (such as
classification, clustering or association analysis), a method
(such as normalization, or building a model), or an algo-
rithm. Complex problems are broken into sub-problems using
hasChildren object property. Specializations of Problemclass – Method, Algorithm, *Preferences
17 – are
used to distinguish between different types of problems. Fig-
ure 3 shows the main concepts of the ontology and their indi-
viduals. For example, classification, clustering,buildModel or normalization are all individuals of
type Problem or its specializations.
The problem decomposition is implemented using
hasChildren generic object property. For example,
classification individual, which represents a standard
classification task, is in hasChildren relation with
normalization, buildModel, and validateModelindividuals.
In order to represent different ’solutions’ of a problem, for
example method_classificationTree is implemented
by alg_C4.5, alg_J4.8, alg_RandomForest or
alg_ADT algorithms, the object property hasOptionshas been introduced with its subtypes: implementedBy,solvedBy and recommend. Figure 4 presents the details
of alg_SMO algorithm. One can note that this algorithm
is a SVM learner (method_SVM is implementedBy this
individual), it has three parameters represented as data prop-
erty assertions (param_cacheSize, param_epsilon,param_buildLogisticModels) and it is exposedByWeka library. A Library is a collection of ML algorithms,
and each algorithm may be implemented by two or more ML
libraries.
The recommender system relies on ML-ontology ability to
infer new facts (reasoning), but characteristics of the dataset
and end-user preferences need to be considered as well. In our
17* stands for Classification, Clustering or other type of DM-ML problems
542540
Figure 3. ML-Ontology: concepts and individuals
approach, we use a set of rules expressed in Semantic Web
Rule Language (SWRL) [19], built on top of static individuals
and relations described so far, to ensure context-aware rea-
soning. The expert knowledge encoded into these rules was
derived based on some empirical studies, such as [20]. For
example, Figure 5 shows the SWRL encoding of the following
rule: ”for a classification problem, if speed of classification is
important, readability of the model is important as well and
the dataset contains both numerical and discrete attributes thenrecommended classifier is Classification Tree”. recommendis an object property, a subtype of hasOptions, used to
represent facts infered by SWRL rules.
The problem decomposition approach allows us to uni-
formly handle recommendations for tasks, methods, or algo-
rithms.
VI. CONCLUSIONS AND FUTURE PLANS
This paper presents the overall architecture of a distributed
system for machine learning / data mining problems that com-
Figure 4. ML-Ontology: algorithm details
ClassificationPreferences(?m),Dataset(?d),numberOfContinuousAttributes(?d, ?noca),numberOfDiscreteAttributes(?d, ?noda),speedOfClassification(?m, ?soc),readability(?m, ?r),integer[>= 4](?soc),integer[>= 4](?r),integer[> 0](?noca),integer[> 0](?noda)->recommend(buildModel,
method_classificationTree)
Figure 5. Method recommendation
bines the semantics encoded in ML-Ontology with distributed
processing frameworks, such as Apache Hadoop. The focus of
the paper is on the recommendation module, whose purpose
is to support and guide end-users in their ML-DM scenarios.
The recommendation module is modeled as a rule-based
system. We are planning to keep track of end-user feedback for
proposed suggestions. As long as the system will accumulate
more and more end-users’ feedback, we will be considering
changing the recommendation model from expert mode to a
collaborative filtering approach.
In the future, we will continue the development of remaining
services of the system and we will run experiments and real-
life use cases to evaluate its performance and overall user
acceptance level.
ACKNOWLEDGMENTS
This work was supported by EC-FP7 project FP7-REGPOT-
2011-1 284595 (HOST).
543541
REFERENCES
[1] D. Pop, “Machine learning and cloud computing: Survey of distributedand saas solutions,” Institute e-Austria Timisoara, Tech. Rep. 2012-1,December 2012.
[2] R. Bekkerman, M. Bilenko, and J. Langford, Eds., Scaling up MachineLearning. Cambridge University Press, 2012. [Online]. Available:http://people.cs.umass.edu/˜ronb/scaling up machine learning.htm
[3] S. Charrington, “Three new tools bring machine learning insightsto the masses,” February 2012, read Write Web. [Online].Available: http://www.readwriteweb.com/hack/2012/02/three-new-tools-bring-machine.php
[4] W. Eckerson, “New technologies for bigdata,” 2012. [Online]. Available: http://www.b-eye-network.com/blogs/eckerson/archives/2012/11/new technologie.php
[5] D. Harris, “5 low-profile startups that could change the face of bigdata,” 2012. [Online]. Available: http://gigaom.com/cloud/5-low-profile-startups-that-could-change-the-face-of-big-data/
[6] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action.Manning Publications, 2011.
[7] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.Hellerstein, “Distributed graphlab: A framework for machine learningand data mining in the cloud,” in Proc. of the 38th Intl. Conf. on VeryLarge Databases (VLDB), Vol. 5, No. 8, August 2012.
[8] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, andJ. Currey, “Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language,” in Proc. of the 8thUSENIX conference on Operating systems design and implementation(OSDI). USENIX Association Berkeley, 2008, pp. 1–14.
[9] M. Budiu, D. Fetterly, M. Isard, F. McSherry, and Y. Yu,“Large-scale machine learning using dryadlinq,” in Scaling upMachine Learning, R. Bekkerman, M. Bilenko, and J. Langford,Eds. Cambridge University Press, 2012. [Online]. Available:http://people.cs.umass.edu/˜ronb/scaling up machine learning.htm
[10] S. Hido, “Jubatus: Distributed online machine learningframework for big data,” in Proc. of the 1st Ex-tremely Large Databases (XLDB) Asia, 2012. [Online].Available: http://www.slideshare.net/JubatusOfficial/distributed-online-machine-learning-framework-for-big-data
[11] A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan, “Nimble: Atoolkit for the implementation of parallel data mining and machinelearning algorithms on mapreduce,” in Proc. of the 17th ACM SIGKDDConference on Knowledge Discovery and Data Mining (KDD), 2011.
[12] A. Ghoting and et al., “Systemml: Declarative machine learning onmapreduce,” in Proc. of the IEEE 27th International Conference onData Engineering (ICDE), 2011, pp. 231–242.
[13] “Apache hadoop website,” 2012, http://hadoop.apache.org.[14] “Kitenga analytics,” 2013, http://www.quest.com/news-release/quest-
software-expands-its-big-data-solution-with-new-hadoop-ce-102012-818658.aspx.
[15] “Wibidata how it works,” 2012, http://www.wibidata.com/product/how-it-works/.
[16] F. Marozzo, D. Talia, and P. Trunfio, “Using clouds for scalable knowl-edge discovery applications,” in Euro-Par 2012: Parallel ProcessingWorkshops, ser. Lecture Notes in Computer Science, I. Caragiannis,M. Alexander, R. Badia, M. Cannataro, A. Costan, M. Danelutto,F. Desprez, B. Krammer, J. Sahuquillo, S. Scott, and J. Weidendorfer,Eds. Springer Berlin Heidelberg, 2013, vol. 7640.
[17] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, and M. Jor-dan, “Mlbase: A distributed machine-learning system,” in Proceedingsof 6th Biennial Conference on Innovative Data Systems Research, ser.CIDR ’13, Asilomar, CA, USA, January 2013.
[18] P. Panov, S. Dzeroski, and L. Soldatova, “Ontodm: An ontology of datamining,” in Proceedings of the 2008 IEEE International Conferenceon Data Mining Workshops, ser. ICDMW ’08. Washington, DC,USA: IEEE Computer Society, 2008, pp. 752–760. [Online]. Available:http://dx.doi.org/10.1109/ICDMW.2008.62
[19] I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, andM. Dean, “SWRL: A Semantic Web Rule Language Combining OWLand RuleML,” W3C Member Submission, World Wide Web Consortium,Tech. Rep., May 2004.
[20] J. Taylor, “Machine learning in the open,” 2012, open Source Bridge.[Online]. Available: http://prezi.com/ish8cqhhiuuc/machine-learning-in-the-open/
544542