20
Data Mining Tools Jean-Gabriel Ganascia LIP6 – University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 [email protected]

Data Mining tools - dac.lip6.frdac.lip6.fr/master/wp-content/uploads/2017/12/Data-Mining-tools.pdf · RapidMiner (Yale) AlphaMiner Mallet – Machine Learning for Language Toolkit

Embed Size (px)

Citation preview

DataMiningToolsJean-GabrielGanasciaLIP6–UniversityPierreetMarieCurie4,placeJussieu,75252Paris,[email protected]

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Datamining

Selection

Pre-treatment

ExtractionData mining

Interpretation/�Visualization

supervised non-supervised

symbolic sequencessequences symbolic

SQL / OQL�adhoc

Google, Yahoo, AltaVista, ...

ReformulationK. domainReducing dimensions.

Evaluation…

DB

DB

DB

DB

ID3, C4.5, CHARADEFOIL, REMO,...

Wspot Cobweb,COING

FLEXPAT

GraphsRules, 3D, RA, VR...

DATA BASES

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

FreeTools

R-project:statisticallibraryTANAGRA–Sipina(Lyon),http://eric.univ-lyon2.fr/~ricco/tanagra/fr/tanagra.html

Weka–NewZeeland(Javalanguage)Orange–Slovania(Pythonlanguage)RapidMiner(Yale)AlphaMinerMallet–MachineLearningforLanguageToolkit(Javalanguage)http://mallet.cs.umass.eduUniversityMassachusetts

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Whatdothosetoolscontain?Inputfile

Fileformat“.tab”“arff”etc.

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Input–type“.tab”Line1attributename

Line2attributetype

Line3class

Separation:tab

Example–file“lenses.tab”age prescription astigmatic tear_rate lenses

discrete discrete discrete discrete discrete

class

young myope no reduced none

young myope no normal soft

presbyopic hypermetrope yes normal none

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Entrée«ARFF»Attribute-RelationFileFormat

EntêteCommentaires–précédéspar%@RELATION<nomrelation>(1ligne)@ATTRIBUTE<nomattribut><Typeattribut>(listedetouslesattributs–1parligne)

@DATA<valA1>,<valA2>,…(listedetouslesexemples–1parligne)

Type:Numeric<nominal-specification>-ensemblevaleursString–entreapostrophess’illachaînecontientdesblancsDate[<formatdate>]

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

ExampleARFFHeader

% 1. Title: Plants data base IRIS

%

% 2. Sources:

% (A) Creator: RA Fisher

% (B) Donor: Michael Marshall (MARSHALL%[email protected])

% (C) Date: July, 1988

%

@ Iris RELATION

@ Attribute sepallength NUMERIC

@ Attribute sepalwidth NUMERIC

@ Attribute petallength NUMERIC

@ Attribute petalwidth NUMERIC

@ Class Attribute {Iris-setosa, Iris versicolor, Iris-virginica}

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

ExampleARFFData

@ Data 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa 4.7,3.2,1.3,0.2, Iris-setosa 4.6,3.1,1.5,0.2, Iris-setosa 5.0,3.6,1.4,0.2, Iris-setosa 5.4,3.9,1.7,0.4, Iris-setosa 4.6,3.4,1.4,0.3, Iris-setosa 5.0,3.4,1.5,0.2, Iris-setosa 4.4,2.9,1.4,0.2, Iris-setosa 4.9,3.1,1.5,0.1, Iris-setosa

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

SparseARFFIftherearemanynullvalues

Thesame,exceptfordata

Nonnullattributesareidentifiedbytheirrank

ExampleARFF@data

0, X, 0, Y, ‘class A’

0, 0, W, 0, ‘class B’

ExampleSparseARFF@data

{1 X, 3 Y, 4 ‘class A’}

{2 W, 4 ‘class B’}

Remark:theabsentvaluescorrespondto0–missingvaluesareidentifiedwith‘?’

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Othersteps• Data preparation

– Feature selection – Data selection – Digitalization – Sampling – Outliers – File fusion (joint) – Concatenation – …

• Data visualization • Classification • Regression • Evaluation • Non supervised learning • Association rules • Text mining

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Datavisualization

ExploratoryDataAnalysisDistributionsLinearprojectionAttributestatisticsCorrespondenceanalysisMosaicdiagrams…

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Classification

• Bayesian classification • Logistic regression • K nearest neighbor • Trees • C4.5 • CN2 • SVM

• Visualization of the classification

– Trees – CN2 rules – …

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Nonsupervisedlearning• Matrix distance from examples • Matrix distance from attributes

• Dendrograms • K-means • …

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Evaluation–supervisedlearning

• Separation – Random – Leave one out – Cross validation

• Indices – Precision-recall – ROC – …

• Test training set/ test set • …

• Confusion matrix • ROC analysis • Prediction • …

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Associationrules

ExtractionofassociationrulesVisualizationofassociationrulesFrequentsets…

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Specializedapplications

• Bioinformatics – Genomes data bases – Gene selection – Profiles – …

• Text mining – Text file – Preprocessing (TF.IDF, lemmatization, stemmatization, …) – Bags of words – N-grams of characters – N-grams of words – Feature extraction – Distance…

SPMFAnOpen-SourceDataMiningLibraryhttp://www.philippe-fournier-viger.com/spmf/

PatternMiningSequentialRuleMiningItemSetsMining…

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Weka

WritteninJava

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Wekahttp://www.cs.waikato.ac.nz/ml/weka/

Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités

Orange

UniversityofLjubljana–SloveniaProgrammedwithPythonhttp://www.ailab.si/orange/

Machine ARI: orange-canvas