6
SOFIA: An Analytics Recommendation System Fatemeh Nargesian, Alain Biem, Prateek Jain, Srinivasan Parthasarathy, and Deepak S. Turaga Department of Computer Science, University of Toronto, IBM Watson Research Center [email protected],[email protected],jainprateek@gmail. com,{spartha,turaga}@us.ibm.com Abstract. The SOFIA analytic recommender system helps non-expert users to select analytics (algorithms and their implementations) to fulfill a data mining task in an effective manner. The recommender is a novel framework that applies an ontology-driven approach to recommend a ranked list of analytics to fulfill a task based on the characteristics of the given dataset. SOFIA relies on an onto- logical representation of data science principles evolved by the data science com- munity; it does not require training examples nor actual deployment of candidate analytics on given datasets. Keywords: Analytics Recommendation; Data mining; Ontology; Meta-learning 1 Introduction In several domains, such as data mining, there is a variety of algorithms and their imple- mentations, in major software packages, that can be considered as candidates to solve particular data mining problems. One of the most difficult tasks in such domains is to decide when one analytic (an algorithm or one of its implementations) is better than the other to do a data mining task on a dataset with particular characteristics. “Meta- learning” is the automatic process of knowledge acquisition that relates the performance of the learning algorithms with the characteristics of the machine learning problems and datasets [4]. SOFIA is an application which delivers the following services: registering data mining analytics (algorithms or executables) by developers and experts, query- ing and searching analytics (keyword search), visualizing analytics, deploying single executables or composed analytics, and recommending analytics. In this paper, we in- troduce the analytics recommendation aspect of SOFIA, demonstrated in Figure 2. The users of SOFIA are data analysts in different domains who are interested in finding the most efficient and effective analytics to do data mining tasks. These users have limited knowledge about data mining algorithms, implementations and the characteristics of their datasets. Several approaches to solving the problem of analytics selection can be identified. The most straightforward approach is to evaluate each available analytic, on the given dataset, and select the one yielding the best results for the task. This data-driven ap- proach is simple and intuitive but is time-consuming when the data set is large and there is a large choice of analytics for the task [2]. A more efficient way is the use of expert or consensus knowledge relating the characteristics of analytics, data mining tasks and datasets [1]. The use of this knowledge is expressed via reasoning rules; however, devel- oping rules for analytics selection may be expensive and unfeasible, since good human

SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation

SOFIA: An Analytics Recommendation System

Fatemeh Nargesian, Alain Biem, Prateek Jain, Srinivasan Parthasarathy, andDeepak S. Turaga

Department of Computer Science, University of Toronto, IBM Watson Research [email protected],[email protected],jainprateek@gmail.

com,{spartha,turaga}@us.ibm.com

Abstract. The SOFIA analytic recommender system helps non-expert users toselect analytics (algorithms and their implementations) to fulfill a data miningtask in an effective manner. The recommender is a novel framework that appliesan ontology-driven approach to recommend a ranked list of analytics to fulfill atask based on the characteristics of the given dataset. SOFIA relies on an onto-logical representation of data science principles evolved by the data science com-munity; it does not require training examples nor actual deployment of candidateanalytics on given datasets.

Keywords: Analytics Recommendation; Data mining; Ontology; Meta-learning

1 Introduction

In several domains, such as data mining, there is a variety of algorithms and their imple-mentations, in major software packages, that can be considered as candidates to solveparticular data mining problems. One of the most difficult tasks in such domains is todecide when one analytic (an algorithm or one of its implementations) is better thanthe other to do a data mining task on a dataset with particular characteristics. “Meta-learning” is the automatic process of knowledge acquisition that relates the performanceof the learning algorithms with the characteristics of the machine learning problems anddatasets [4]. SOFIA is an application which delivers the following services: registeringdata mining analytics (algorithms or executables) by developers and experts, query-ing and searching analytics (keyword search), visualizing analytics, deploying singleexecutables or composed analytics, and recommending analytics. In this paper, we in-troduce the analytics recommendation aspect of SOFIA, demonstrated in Figure 2. Theusers of SOFIA are data analysts in different domains who are interested in finding themost efficient and effective analytics to do data mining tasks. These users have limitedknowledge about data mining algorithms, implementations and the characteristics oftheir datasets.Several approaches to solving the problem of analytics selection can be identified.The most straightforward approach is to evaluate each available analytic, on the givendataset, and select the one yielding the best results for the task. This data-driven ap-proach is simple and intuitive but is time-consuming when the data set is large and thereis a large choice of analytics for the task [2]. A more efficient way is the use of expertor consensus knowledge relating the characteristics of analytics, data mining tasks anddatasets [1]. The use of this knowledge is expressed via reasoning rules; however, devel-oping rules for analytics selection may be expensive and unfeasible, since good human

Page 2: SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation

2 Fatemeh Nargesian et al.

experts are not always available [2]. SOFIA has unified the consensus knowledge aboutdifferent data mining algorithms and analytics in order to support automatic knowledgediscovery and provide a base for research. SOFIA ontology is an OWL-based ontologywhich represents machine learning analytics at various levels of abstractions in a taxon-omy. At the high level, the SOFIA ontology represents general data mining tasks suchas classification, clustering, or prediction, etc. Each task can then be achieved by a setof algorithms and each algorithm has various implementations registered in the system.The SOFIA ontology, currently, consists of 342 implementations in languages such asPython, Java and C++ for 461 algorithms and 227 number of data mining tasks.SOFIA automatically derives the characteristics of a dataset by analyzing its schemaand data values and infers more characteristics following some ontological rules em-bedded in the ontology. Then, a set of recommendation rules are generated on the fly tomatch the characteristics of SOFIA analytics and those of the given dataset to come upwith recommendations. In SOFIA, the logic of analytics and dataset property matchingis expressed, by a data mining expert, as simple analytic-to-dataset characteristics map-pings. A mapping is an association between a dataset’s property and its value and ananalytic’s property and its value. It describes a partial condition for matching analyticsto datasets. An instance of an analytic-to-dataset mapping is: if an algorithm supportsmulti-category classification, it could be deemed as a recommendation for a dataset withmore than two class labels. A composition of mappings between dataset properties andanalytic properties discovers whether the analytic would perform a specific task effi-ciently and effectively on the dataset. An example of such composition is shown in Fig-ure 1(b) as a rule in Semantic Web Rule Language (SWRL)(http://protege.stanford.edu).According to this rule, it is inferred that, in Figure 1(a), a classification algorithm A withshown characteristics is a good match for dataset D, since the algorithm A is relativelyrobust to noise and missing data, thus, it can efficiently handle the large amount ofnoise and missing data in dataset D. Moreover, the input and class type of algorithm Ais similar to those of dataset D. In SOFIA, other properties of analytics, such as effi-ciency, numerical stability, etc. are also taken into account while matching a dataset toanalytics. SOFIA adopts an algorithm for automatic compilation of mappings and gen-erating recommendation rules. These rules are created dynamically on the fly accordingto the dataset and data mining task provided by the user. After incorporating the rulesin SOFIA ontology, the reasoning algorithm of SOFIA evaluates the rules against theontology and the recommendations are inferred.

2 SOFIA Analytics Recommender

Sofia analytics recommender has two modes for different users: (1) domain experts, (2)data analysts. The domain expert mode allows data mining experts of SOFIA to modifydataset-analytics mappings and semantic general rules in text format. This applicationmode of SOFIA is not demonstrated in this paper due to lack of space. In data analystmode, the user first specifies a dataset by uploading a dataset in CSV format or selectingfrom a list of existing datasets (UCI repository). Then, she selects a data mining taskand the number of required recommendations as shown in left part of Figure 2. Basedon the selected task, the system might ask the user for extra information. For instance,in a classification task, SOFIA needs to know which column in the CSV file is the class

Page 3: SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation

SOFIA: An Analytics Recommendation System 3

(a) Datasets and analytics properties. (b) A recommendation rule written in SWRL.

Fig. 1: Matching datasets and analytics.

label column. SOFIA extracts native, derived and task-specific properties of the datasetsuch as size, missing data level, noisy data level, etc. These properties are shown tothe user in case a user would like to do further corrections on dataset properties, asin the middle part of Figure 2. In the next step, the recommender composes a set ofthe recommendation rules, on the fly, based on the characteristics of the dataset andselected task. Then, the recommendation rules are incorporated into SOFIA ontology.The recommender uses Pellet reasoner (http://clarkparsia.com/pellet) to evaluate therules. These analytics are then ranked by the sum of the mappings’ significance scoresinvolving in the rule that infers them [3]. In the following sections, we describe thedataset properties extraction and rule composition components of SOFIA recommenderin more details.

2.1 Dataset Properties Extraction

SOFIA represents a dataset with a set of native (e.g. data type, size), derived (e.g. miss-ing data level, noisy data level) and task specific properties (e.g. number of classes,class label type for classification task). Dataset properties extraction module scans thedataset provided by the user and generates the dataset profile as data property values.For example, missing data ratio is the average number of missing data values in eachrow of a dataset. This ratio can then be translated into a string value for missing datalevel property (e.g., small, medium and large) assuming threshold intervals.

2.2 Automatic Generation of Recommendation Rules

Manually creating recommendation rules that result in high quality recommendationscosts domain experts’ effort and time. SOFIA takes as input a set of simple mappings,written by an expert, and generates a set of recommendation ontological rules, automat-ically and on the fly, by combining appropriate mappings. The body of a recommenda-tion rule is a conjunction of conditions on dataset and analytics’ properties that qual-ify analytics as recommendations for a dataset. The rule generation module of SOFIAadopts a filtering approach to prune candidate analytics for recommendation. The al-gorithm starts by assuming that all analytics available in the ontology are potentialrecommendations. The mapping selection procedure follows a greedy policy to selectmappings based on significance scores. These scores are provided in the mapping fileby the domain expert. At each step, a mapping is selected and if it premise is satisfied

Page 4: SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation

4 Fatemeh Nargesian et al.

Fig. 2: The data analyst inputs a dataset and a data mining task (left) the derived dataset propertiesare confirmed by the user (middle) the recommended analytics are returned to the user.

based on the characteristics of the dataset, the clauses in its head are added to the par-tial recommendation rule. The head clauses are mostly related to the characteristics ofrecommended analytics. Then, the partial rule is added to the ontology and the reasonerinfers recommended analytics that might be filtered in the next filtering steps. The filter-ing procedure terminates when the partial rule is restrictive enough to return the numberof required recommendations.

3 Implementation and Demonstation

In this demonstration, we introduce SOFIA, a tool to explore data mining analytics andrequest for recommendations for fulfilling a task on a user given dataset. SOFIA hasbeen implemented using Java. SOFIA ontology represents machine learning analyticsusing Web Ontology Language (OWL). SOFIA offers an interactive web interface im-plemented using play framework. In the demonstration, users can play with SOFIAontology visualization tool order to become familiar with existing analytics’ proper-ties, available at different levels of taxonomy. They can also browse description andontological information about different analytics and search for analytics using key-words. Users can also advocate expert users and register new analytics or matchingmappings to SOFIA. On the other hand, a user can use SOFIA as a data analyst re-questing analytics recommendations for fulfilling a data mining task on a dataset. Inthis demonstration, we will highlight features of SOFIA using both UCI repositorydatasets (https://archive.ics.uci.edu/ml/datasets.html) and real datasets.

References1. Christophe Giraud-Carrier et al. Introduction to the special issue on meta-learning. In Machine

Learning, volume 54, pages 187–193, 2004.2. Ricardo B. C. Prudncio et al. Meta-learning approaches to selecting time series models. In

Machin Learning, pages 121–137, 2004.3. et al. Quan Sun. Pairwise meta-rules for better meta-learning-based algorithm ranking. In

Machine Learning, volume 93, pages 141–161, 2013.4. et al. Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection.

In ACM Computer Survey, volume 41, 2009.

Page 5: SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation
Page 6: SOFIA: An Analytics Recommendation Systemceur-ws.org/Vol-1486/paper_112.pdf · SOFIA: An Analytics Recommendation System 3 (a) Datasets and analytics properties. (b) A recommendation