RM World 2014: A user interface for big data with RapidMiner

Preview:

DESCRIPTION

 

Citation preview

A User Interface For Big Data With RapidMiner

Marcelo Beckmann

Nelson F. F. Ebecken

Beatriz S. L. Pires de Lima

Myrian Christina de Aragão Costa

Agenda

Introduction

Previous Work

Motivations

Architecture

Operators

Mouse Runner

Experiments

Conclusion

Introduction

Since 2012, 2.5 exabytes of data were created every day, andthis volume is still growing;

How to extract useful information from this daily mountain ofdata?

Map Reduce paradigm and it's related frameworks answered thequestion;

Google, Yahoo, Netflix, Amazon, YouTube, Facebook, and Appleare good examples of successful big data projects;

The Hadoop environment is the result of the great effort madeby open source initiatives since 2004;

Introduction

Despite the great progress made in the backend engines, thereis a lack of user interfaces;

Nowadays, most of the work to configure and run Hadoopcomponents is done through scripting;

This work aims to contribute some how to improve thisscenario.

Previous Work

Since the MapReduce advent, the research and development were more focused on backend engines;

In the last years, several initiatives started to make the Hadoop environment more user friendly;

Companies like Cloudera, Pentaho, Talend, Hortonworks made huge contributions to improve the tools usability in Hadoop environment, specially in execution control, ETL and databases;

Radoop (*) made significant contributions to integrate Mahout to RapidMiner with a proprietary solution.

* Radoop was acquired by Rapidminer in July/2014

Motivations

The Hadoop environment, and specially the Mahout engine, still lack of an open source UI integration;

In terms of Java coding, the job start, remote API calls, and result retrieval from the Hadoop environment is too complex. An encapsulation is needed to simplify this kind of activity;

There are integration and connectivity problems in heterogeneous environments and complex network infrastructure.

Architecture

Our research

Architecture. Big Data Extension

RapidMiner is easy to extend;

A RapidMiner extension with 14 operators was created;

Big data operators can be mixed with already existingRapidMiner operators, in order to run jobs and analyzeresults;

Integrated with Hadoop, HDFS, Hive, Mahout;

Open Source.

. Mouse Runner Provides an extra layer for remote call and activation;

Reduces the coupling between presentation-tier and businessservices;

Start jobs and retrieve results from the Hadoop relatedcomponents.

Operators

Operators

Masters node – Contains all the configuration necessary toconnect the operators to a Hadoop environment;

IO Operators – Execute operations in HDFS and HiveDatabase;

Read Hive Database – Execute queries in Hive Database,returns an ExampleSet with samples, but points to a file inHDFS. Other Big Data operators will refer to this pointedfile, not the samples;

Clustering – Cluster algorithms from Mahout;

Transformation –To perform transformations in Hivedatabase;

Utility - Run scripts through SSH connection, Kill Jobs.

Mouse RunnerMouse Runner simplifies the call to Hadoop components

KMeansRunner runner =new KMeansRunner();

runner.setHost("192.168.13.131");

runner.setHdfsPort("9000");

runner.setMapredPort("9001");

runner.setInputPath("/user/hadoop-users/testdata");

runner.setOutputPath("/user/hadoop-user/output");

runner.setK(5);

runner.setMaxRuns(10);

ClusterResult result = runner.run();

Mouse Runner

Ports to open:

9000, 9001, 50070, 50075, 50090, 50105, 50030, 50060, 8020, 50010, 50020, 50100, 10000, ...

Integration among heterogeneous OS and networks

Mouse Runner

Ports to open:

9999, 10000

Experiments

• K-means clustering comparison between RapidMiner and Mahout using Davies–Bouldin index;

•Davies–Bouldin index: Has an internal evaluation method to measure the quality of clusters, the lower the DBI better the cluster quality;

•The aim is to validate the integration made with Mahout, using the RapidMiner K-Means as baseline;

•Datasets: Synthetic Control, Covertype and Household from UCI machine learning repository;

•Results obtained in terms of Davies-Bouldin were pretty similar;

•RapidMiner had an instant response in the smaller dataset;

•Mahouts scaled better in the bigger datasets.

Experiments

Experiments

Conclusion

•An open source extension for RapidMiner called “Big Data” wascreated;

•This extension Integrates RapidMiner with Hadoop, HDFS, Hive andMahout;

•Counts initially with 14 operators;

•Created a component called “Mouse Runner”, wich provides remoteactivation facilities and a simplified API for activation and resultretrieval for Hadoop related components;

•A comparisson between K-Means operators from RapidMiner andMahout showed similar results in terms of Davies-Bouldin index;

•Mahout scaled better in bigger datasets. RapidMiner had instantresponse in the smaller dataset.

•Thanks for your audience!

nelson@ntt.ufrj.br

beckmann.marcelo@gmail.com

Recommended