
MEX Interfaces: Automating Machine Learning Metadata Generation

D. Esteves, Pablo N. Mendes, D. Moussallem, J.C. Duarte, Amrapali Zaveri, Jens Lehmann, Ciro Baron Neto, Igor Costa and Maria Claudia Cavalcanti

University of Leipzig

September 13, 2016 1

What’s metadata and why is it so important?

“Metadata is data that provides information about other data”

- Data Management- Meta Analysis (ML)- Social Engines- ...


=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2Relation: irisInstances: 150Attributes: 5 sepallength sepalwidth petallength petalwidth classTest mode:evaluate on training data...TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 Iris-setosa 0.98 0.02 0.961 0.98 0.97 0.99 Iris-versicolor 0.96 0.01 0.98 0.96 0.97 0.99 Iris-virginicaWeighted Avg. 0.98 0.01 0.98 0.98 0.98 0.993...

How costly is the metadata generation process?


How costly is the metadata generation process?


Reproducible Research - in theory

1.“Data/Metadata publicly available”

2.“The computer code and all the computational procedures should be available”

3.“Ideally the computer code will encompass all of the steps of computational analysis”

Dr. Peng / Dr. Jeff Leek


Reproducible Research - in practice

- Experiments are hard to reproduce, when not impossible.


ML/DM Environments




IDEsFrameworksWorkflow Systems Collaborative Env.

Machine Learning Frameworks


Platform Advantage Drawbacks

MLF Front-end No (High) Interoperability

No/low updates delay

No much code flexibility

(Low) Workflow Management



Workflow Systems / Collaborative Environments


Platform Advantage Drawbacks

WFS High Provenance

No (High) Interoperability

Interoperability (*)

Updates are tool dependent

Workflow Management

No much code flexibility


Workflow Systems Collaborative


Integrated Development Environments / Libraries


Platform Advantage Drawbacks

IDE/MLL High code-flexibility

Low Provenance

No learning curve

Low Interoperability

Data Management costs more


Integrated Development Environments / Libraries


Platform Advantage Drawbacks

IDE/MLL High code-flexibility

Low Provenance

No learning curve

Low Interoperability

Data Management costs more

IDEs motivation

Research Question

1.How to export machine learning variables - incomes/outcomes (IDEs)?

- Architecture

- Schema

2.What is the existing approach that minimizes the coding effort (IDEs)?


Existing solutions


Solution Advantage Drawbacks

stdout No Extra Coding Effort Required

Lack of ProvenanceLack of InteroperabilityLack of Data Query Feature

DBMS Data Query Feature

Extra Coding Effort (Integration)Lack of ProvenanceLack of Interoperability

Self-schema Definition

Straightforward Solution

Extra Coding EffortExtra Analysis Effort (modeling)Lack of ProvenanceLack of Interoperability


MEX Interfaces: from annotations to semantic metadata


Method Advantage Drawbacks

MEX Interfaces

- Provenance- Interoperabili

ty- Data Query

Feature- Automatic

Metadata Generation

- Extra Processing Time

- Security Issues (due to reflection)


MEX Interfaces: from annotations to semantic metadata

1. Allow metadata generation regardless of the IDE, machine-learning library and context of the experiment


- Generates metadata based on one of the state-of-the-art vocabularies for machine learning

:exp_cf_1_2025644708_exe_2_algo a mexalgo:Algorithm ;

rdfs:label "Support Vector Machines" ;

mexalgo:hasAlgorithmClass mexalgo:SupportVectorMachines ;

mexalgo:hasHyperParameter :exp_cf_1_2025644708_exe_2_hyperpar_4, :exp_

cf_1_2025644708_exe_2_hyperpar_3, :exp_cf_1_2025644708_exe_2_hyperpar_2,

:exp_cf_1_2025644708_exe_2_hyperpar_1 ;

dct:identifier "svm".


MEX Vocabulary


Vocabularies and Ontologies for ML/DM






W3C ML-SchemaCommunityGroup KDD


Vocabularies and Ontologies for ML/DM: MEX


Agnieszka Lawrynowicz et al. “The Algorithm-Implementation-Execution Ontology Design Pattern”, WOP2016.

Algorithm ⊑ InformationEntity Implementation ⊑ InformationEntity Implementation ⊑ ∃implements.Algorithm Implementation ⊑ ∃hasParameter.Parameter Execution ⊑ Process Execution ⊑ ∃hasInput.ParameterSetting Execution ⊑ ∃realizes.Algorithm...

is a lightweight and flexible schema for machine learning, based on PROV-O PROV-O: The PROV


- Reduces the overall coding effort

- No wrappers/adaptors/connectors do DBMS

- No schema development required

- Transparent process

MEX Interfaces: Features


- Transparent process (annotations and reflection)

Features: MEX Annotations


@ExperimentInfo(identifier = "e1", createdBy = "Esteves", email = "[email protected]", title = "Weka Lib Example", tags = {"WEKA","J48", "DecisionTable", "MEX",


@Hardware(cpu = MEXEnum.EnumProcessors.INTEL_COREI7, memory = MEXEnum.EnumRAM.SIZE_8GB,

hdType = "SSD")

@SamplingMethod(klass = MEXEnum.EnumSamplingMethods.CROSS_VALIDATION, trainSize = 0.5,

testSize = 0.5, folds = 10)

@InterfaceVersion(version = MEXEnum.EnumAnnotationInterfaceStyles.M1)

public class WekaExample001 { … }

java -cp /home/mexframework org.aksw.mex.framework.MetaGeneration -uc -out


- Transparent process (annotations and reflection)

Features: MEX Annotations


@Algorithm(algorithmClass = MEXEnum.EnumAlgorithmsClasses.J48, algorithmID = "1",

algorithmName = "J48", algorithmURI = "


public J48 wekaJ48;

@Algorithm(algorithmClass = MEXEnum.EnumAlgorithmsClasses.PART, algorithmID = "2",

algorithmName = "PART", algorithmURI = "


public PART wekaPART;

@DatasetName public String ds = "iris.arff"; Instances data; ...

java -cp /home/mexframework org.aksw.mex.framework.MetaGeneration -uc -out


java -cp /home/mexframework org.aksw.mex.framework.MetaGeneration -uc -out


- Transparent process (annotations and reflection)

@Measure(idMeasure = MEXEnum.EnumMeasures.ERROR) public List<Double> errors;

@Measure(idMeasure = MEXEnum.EnumMeasures.ACCURACY) public List<Double> accuracies;


public void myMainMethod(){ throws Exception {...}

Features: MEX Annotations


MEX Annotations Log


Starting the process: MetaGeneration -uc interfaces.WekaExample001 -out mymex01.ttl [main] INFO org.aksw.mex.interfaces.MetaGeneration - ********************** MEX Interfaces ********************** [main] INFO org.aksw.mex.interfaces.MetaGeneration - [main] INFO org.aksw.mex.interfaces.MetaGeneration - Starting the meta annotation for class named: WekaExample001 [main] INFO org.aksw.mex.interfaces.MetaGeneration - @ExperimentInfo - OK [main] INFO org.aksw.mex.interfaces.MetaGeneration - @Hardware - OK [main] INFO org.aksw.mex.interfaces.MetaGeneration - @SamplingMethod - OK [main] INFO org.aksw.mex.interfaces.MetaGeneration - invoking the main method: start [main] INFO interfaces.WekaExample001 - Accuracy of J48: 94.00% - Error: 6.00% [main] INFO interfaces.WekaExample001 - Accuracy of PART: 90.67% - Error: 9.33% [main] INFO interfaces.WekaExample001 - Accuracy of DecisionTable: 92.67% - Error: 7.33% [main] INFO interfaces.WekaExample001 - Accuracy of DecisionStump: 36.67% - Error: 63.33% [main] INFO org.aksw.mex.interfaces.MetaGeneration - invoking the features method: getFeatures [main] INFO org.aksw.mex.interfaces.MetaGeneration - @DataSet - OK [main] INFO org.aksw.mex.interfaces.MetaGeneration - @Algorithm - OK [main] INFO org.aksw.mex.interfaces.MetaGeneration - :: starting to add executions... [main] INFO org.aksw.mex.interfaces.MetaGeneration - :: nr. executions = 4 [main] INFO org.aksw.mex.interfaces.MetaGeneration - :: idExecution = C1_MEX_EXEC_D1 [main] INFO org.aksw.mex.interfaces.MetaGeneration - error of Execution C1_MEX_EXEC_D1 : 6.0 [main] INFO org.aksw.mex.interfaces.MetaGeneration - accuracy of Execution C1_MEX_EXEC_D1 : 94.0 [main] INFO org.aksw.mex.interfaces.MetaGeneration - :: idExecution = C1_MEX_EXEC_D2 [main] INFO org.aksw.mex.interfaces.MetaGeneration - error of Execution C1_MEX_EXEC_D2 : 9.333333333333329 [main] INFO org.aksw.mex.interfaces.MetaGeneration - accuracy of Execution C1_MEX_EXEC_D2 : 90.66666666666667 [main] INFO org.aksw.mex.interfaces.MetaGeneration - :: idExecution = C1_MEX_EXEC_D3 [main] INFO org.aksw.mex.interfaces.MetaGeneration - error of Execution C1_MEX_EXEC_D3 : 7.333333333333329 [main] INFO org.aksw.mex.interfaces.MetaGeneration - accuracy of Execution C1_MEX_EXEC_D3 : 92.66666666666667 [main] INFO org.aksw.mex.interfaces.MetaGeneration - :: idExecution = C1_MEX_EXEC_D4 [main] INFO org.aksw.mex.interfaces.MetaGeneration - error of Execution C1_MEX_EXEC_D4 : 63.333333333333336 [main] INFO org.aksw.mex.interfaces.MetaGeneration - accuracy of Execution C1_MEX_EXEC_D4 : 36.666666666666664 [main] WARN org.aksw.mex.log4mex.MEXSerializer - No model defined [main] WARN org.aksw.mex.log4mex.MEXSerializer - No tool defined [main] WARN org.aksw.mex.log4mex.MEXSerializer - No tool parameter defined [main] INFO org.aksw.mex.interfaces.MetaGeneration - The MEX file has been successfully created: share it ;-) [main] INFO org.aksw.mex.interfaces.MetaGeneration - process execution time (s): 1

Metadata file example

:hardware a mexcore:HardwareConfiguration, prov:Entity ; mexcore:cpu "Intel Core i7" ; mexcore:hd "SSD" ; mexcore:memory "8 GB" .:execution_3 a mexcore:OverallExecution, mexcore:group "true" ; prov:id "3" ; prov:used this:phaseTEST ; prov:wasInformedBy this:configuration1 ....

:sampling a prov:Entity , mexcore:CrossValidation ;mexcore:folds "10" ;mexcore:sequential "true";mexcore:testSize "0.5" ;mexcore:trainSize "0.5" .

:feature4 a mexcore:Feature, prov:Entity ; rdfs:label "petalwidth" ; dct:identifier "4" .:measure3_1 a mexperf:StatisticalMeasure; mexperf:error "7.333333333333329" ; prov:wasInformedBy this:execution_3 ....

Features: Output file sample


Features: Logging Library

- Transparent process (logging)


MyMEX mex = new MyMEX();


mex.setAuthorName("D Esteves");

mex.setAuthorEmail("[email protected]");

mex.setOrganization("Leipzig University");


mex.setExperimentTitle("my first experiment");

… }

- Transparent process (logging)

try{ MEXSerializer.getInstance().saveToDisk("./metafiles/log4mex/

ex003", "", mex,


}catch (Exception e){


Features: Logging Library







ACY, x);


MEX Interfaces output x Weka default output

Features: Output file sample


=== Evaluation on training set ====== Summary ===Correctly Classified Instances 147 98%Incorrectly Classified Instances 3 2%Kappa statistic 0.97 Mean absolute error 0.0233Root mean squared error 0.108 Relative absolute error 5.2482%Root relative squared error 22.9089%Total Number of Instances 150

this:m11 a prov:Entity, mexperf:PerformanceMeasure;

dct:identifier "WekaPerformances";

mexperf:accuracy "0.9768"^^xsd:float;

mexperf:truePositive "147"^^xsd:integer;

mexperf:falsePositive "3"^^xsd:integer;

mexperf:kappaStatistics "0.97"^^xsd:float;

mexperf:meanAbsoluteError "0.0233"^^xsd:float;

mexperf:rootMeanSquaredError "0.108"^^xsd:float;





prov:wasGeneratedBy this:ep1;.


Features: Data Visualization (


Features: Data Management “for free”


- ML Model that takes 3 days to be executed

- Iterates 300 times

- Produces/Has 5000 outcomes/incomes

- ….

- What are the top 4 configurations?

Features: Data Management “for free”


PREFIX mexcore: <>

PREFIX mexperf: <>

PREFIX mexalgo: <>

PREFIX prov: <>

PREFIX rdfs: <>

SELECT DISTINCT ?ExecutionID ?Algorithm ?Performance ?fMeasure WHERE {

?execution prov:used ?alg; prov:id ?ExecutionID.

?Performance prov:wasGeneratedBy ?execution.

?Performance mexperf:f1Measure ?fMeasure.

?alg a mexalgo:Algorithm.

?alg rdfs:label ?Algorithm.


ORDER BY DESC (?fMeasure)


Features: Data Management “for free”


?ExecutionID ?Algorithm ?Performance ?fMeasure

"C0_MEX_EXEC_D44" "BaggingJ48" mea_clas_C0_MEX_EXEC_D44_cf_1_-568657719 0.9968

"C0_MEX_EXEC_D24" "Logistic Model Trees"

mea_clas_C0_MEX_EXEC_D24_cf_1_-568657719 0.9952

"C0_MEX_EXEC_D16" "Random Forest"

mea_clas_C0_MEX_EXEC_D16_cf_1_-568657719 0.9920

"C0_MEX_EXEC_D64" "Multilayer Perceptron"

mea_clas_C0_MEX_EXEC_D64_cf_1_-568657719 0.99

Features: Data Management “for free”


Conclusions and Future Work

Machine Learning Metadata Generation

- Generating high quality metadata is not a straightforward process

- Dealing with different outputs is not time-efficient

RQ1/RQ2 -> New methodology

- To automatize the process metadata generation for IDEs

- Data Management

- Based on one of the state-of-the-art vocabularies

Future Work

- Integrate others ML ontologies (ML-Schema)

- Analyse the coverage of the methodology with more machine learning scenarios

- To create a more robust framework (e.g.: automatic pipelines based on configuration files)


Conclusions and Future Work

Thank you! Questions?

