Upload
truongthien
View
215
Download
1
Embed Size (px)
Citation preview
DataMiningToolsJean-GabrielGanasciaLIP6–UniversityPierreetMarieCurie4,placeJussieu,75252Paris,[email protected]
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Datamining
Selection
Pre-treatment
ExtractionData mining
Interpretation/�Visualization
supervised non-supervised
symbolic sequencessequences symbolic
SQL / OQL�adhoc
Google, Yahoo, AltaVista, ...
ReformulationK. domainReducing dimensions.
Evaluation…
DB
DB
DB
DB
ID3, C4.5, CHARADEFOIL, REMO,...
Wspot Cobweb,COING
FLEXPAT
GraphsRules, 3D, RA, VR...
DATA BASES
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
FreeTools
R-project:statisticallibraryTANAGRA–Sipina(Lyon),http://eric.univ-lyon2.fr/~ricco/tanagra/fr/tanagra.html
Weka–NewZeeland(Javalanguage)Orange–Slovania(Pythonlanguage)RapidMiner(Yale)AlphaMinerMallet–MachineLearningforLanguageToolkit(Javalanguage)http://mallet.cs.umass.eduUniversityMassachusetts
…
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Whatdothosetoolscontain?Inputfile
Fileformat“.tab”“arff”etc.
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Input–type“.tab”Line1attributename
Line2attributetype
Line3class
Separation:tab
Example–file“lenses.tab”age prescription astigmatic tear_rate lenses
discrete discrete discrete discrete discrete
class
young myope no reduced none
young myope no normal soft
…
presbyopic hypermetrope yes normal none
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Entrée«ARFF»Attribute-RelationFileFormat
EntêteCommentaires–précédéspar%@RELATION<nomrelation>(1ligne)@ATTRIBUTE<nomattribut><Typeattribut>(listedetouslesattributs–1parligne)
@DATA<valA1>,<valA2>,…(listedetouslesexemples–1parligne)
Type:Numeric<nominal-specification>-ensemblevaleursString–entreapostrophess’illachaînecontientdesblancsDate[<formatdate>]
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
ExampleARFFHeader
% 1. Title: Plants data base IRIS
%
% 2. Sources:
% (A) Creator: RA Fisher
% (B) Donor: Michael Marshall (MARSHALL%[email protected])
% (C) Date: July, 1988
%
@ Iris RELATION
@ Attribute sepallength NUMERIC
@ Attribute sepalwidth NUMERIC
@ Attribute petallength NUMERIC
@ Attribute petalwidth NUMERIC
@ Class Attribute {Iris-setosa, Iris versicolor, Iris-virginica}
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
ExampleARFFData
@ Data 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa 4.7,3.2,1.3,0.2, Iris-setosa 4.6,3.1,1.5,0.2, Iris-setosa 5.0,3.6,1.4,0.2, Iris-setosa 5.4,3.9,1.7,0.4, Iris-setosa 4.6,3.4,1.4,0.3, Iris-setosa 5.0,3.4,1.5,0.2, Iris-setosa 4.4,2.9,1.4,0.2, Iris-setosa 4.9,3.1,1.5,0.1, Iris-setosa
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
SparseARFFIftherearemanynullvalues
Thesame,exceptfordata
Nonnullattributesareidentifiedbytheirrank
ExampleARFF@data
0, X, 0, Y, ‘class A’
0, 0, W, 0, ‘class B’
ExampleSparseARFF@data
{1 X, 3 Y, 4 ‘class A’}
{2 W, 4 ‘class B’}
Remark:theabsentvaluescorrespondto0–missingvaluesareidentifiedwith‘?’
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Othersteps• Data preparation
– Feature selection – Data selection – Digitalization – Sampling – Outliers – File fusion (joint) – Concatenation – …
• Data visualization • Classification • Regression • Evaluation • Non supervised learning • Association rules • Text mining
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Datavisualization
ExploratoryDataAnalysisDistributionsLinearprojectionAttributestatisticsCorrespondenceanalysisMosaicdiagrams…
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Classification
• Bayesian classification • Logistic regression • K nearest neighbor • Trees • C4.5 • CN2 • SVM
• Visualization of the classification
– Trees – CN2 rules – …
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Nonsupervisedlearning• Matrix distance from examples • Matrix distance from attributes
• Dendrograms • K-means • …
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Evaluation–supervisedlearning
• Separation – Random – Leave one out – Cross validation
• Indices – Precision-recall – ROC – …
• Test training set/ test set • …
• Confusion matrix • ROC analysis • Prediction • …
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Associationrules
ExtractionofassociationrulesVisualizationofassociationrulesFrequentsets…
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Specializedapplications
• Bioinformatics – Genomes data bases – Gene selection – Profiles – …
• Text mining – Text file – Preprocessing (TF.IDF, lemmatization, stemmatization, …) – Bags of words – N-grams of characters – N-grams of words – Feature extraction – Distance…
SPMFAnOpen-SourceDataMiningLibraryhttp://www.philippe-fournier-viger.com/spmf/
PatternMiningSequentialRuleMiningItemSetsMining…
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Jean-GabrielGANASCIA EquipeACASA–LIP6–UPMC–SorbonneUniversités
Wekahttp://www.cs.waikato.ac.nz/ml/weka/