Upload
tyler-mclaughlin
View
216
Download
0
Embed Size (px)
Citation preview
WekaJust do it
Free and Open Source
ML Suite
Ian Witten & Eibe Frank
University of Waikato
New Zealand
Overview• Classifiers, Regressors, and clusterers• Multiple evaluation schemes• Bagging and Boosting• Feature Selection:
– right features and data key to successful learning
• Experimenter• Visualizer• Text not up to date.• They welcome additions.
Learning Tasks• Classification: given examples labelled
from a finite domain, generate a procedure for labelling unseen examples.
• Regression: given examples labelled with a real value, generate procedure for labelling unseen examples.
• Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.
Data Format: IRIS@RELATION iris
@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth REAL@ATTRIBUTE petallength REAL@ATTRIBUTE petalwidth REAL@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA5.1,3.5,1.4,0.2,Iris-setosa4.9,3.0,1.4,0.2,Iris-setosa4.7,3.2,1.3,0.2,Iris-setosaEtc.General from @atttribute attribute-name REAL or list of values
J48 = Decision Tree
petalwidth <= 0.6: Iris-setosa (50.0) : # under nodepetalwidth > 0.6 # ..number wrong| petalwidth <= 1.7| | petallength <= 4.9: Iris-versicolor (48.0/1.0)| | petallength > 4.9| | | petalwidth <= 1.5: Iris-virginica (3.0)| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation
• Correctly Classified Instances 143 95.3%
• Incorrectly Classified Instances 7 4.67 %
• Default 10-fold cross validation i.e.– Split data into 10 equal sized pieces– Train on 9 pieces and test on remainder– Do for all possibilities and average
J48 Confusion Matrix
Old data set from statistics: 50 of each class
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 3 47 | c = Iris-virginica
Precision, Recall, and Accuracy• Precision: probability of being correct given
that your decision.– Precision of iris-setosa is 49/49 = 100%– Specificity in medical literature
• Recall: probability of correctly identifying class.– Recall accuracy for iris-setosa is 49/50 = 98%– Sensitity in medical literature
• Accuracy: # right/total = 143/150 =~95%
Other Evaluation Schemes
• Leave-one-out cross-validation– Cross-validation where n = number of training
instanced
• Specific train and test set– Allows for exact replication– Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling
• Randomly select n with replacement from n
• Expect about 2/3 to be chosen for training– Prob of not chosen = (1-1/n)^n ~ 1/e.
• Testing on remainder
• Repeat about 30 times and average.
• Avoids partition bias