Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand

WekaJust do it

Free and Open Source

ML Suite

Ian Witten & Eibe Frank

University of Waikato

New Zealand

Overview• Classifiers, Regressors, and clusterers• Multiple evaluation schemes• Bagging and Boosting• Feature Selection:

– right features and data key to successful learning

• Experimenter• Visualizer• Text not up to date.• They welcome additions.

Learning Tasks• Classification: given examples labelled

from a finite domain, generate a procedure for labelling unseen examples.

• Regression: given examples labelled with a real value, generate procedure for labelling unseen examples.

• Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.

Data Format: IRIS@RELATION iris

@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth REAL@ATTRIBUTE petallength REAL@ATTRIBUTE petalwidth REAL@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA5.1,3.5,1.4,0.2,Iris-setosa4.9,3.0,1.4,0.2,Iris-setosa4.7,3.2,1.3,0.2,Iris-setosaEtc.General from @atttribute attribute-name REAL or list of values

J48 = Decision Tree

petalwidth <= 0.6: Iris-setosa (50.0) : # under nodepetalwidth > 0.6 # ..number wrong| petalwidth <= 1.7| | petallength <= 4.9: Iris-versicolor (48.0/1.0)| | petallength > 4.9| | | petalwidth <= 1.5: Iris-virginica (3.0)| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)| petalwidth > 1.7: Iris-virginica (46.0/1.0)

Cross-validation

• Correctly Classified Instances 143 95.3%

• Incorrectly Classified Instances 7 4.67 %

• Default 10-fold cross validation i.e.– Split data into 10 equal sized pieces– Train on 9 pieces and test on remainder– Do for all possibilities and average

J48 Confusion Matrix

Old data set from statistics: 50 of each class

a b c <-- classified as

49 1 0 | a = Iris-setosa

0 47 3 | b = Iris-versicolor

0 3 47 | c = Iris-virginica

Precision, Recall, and Accuracy• Precision: probability of being correct given

that your decision.– Precision of iris-setosa is 49/49 = 100%– Specificity in medical literature

• Recall: probability of correctly identifying class.– Recall accuracy for iris-setosa is 49/50 = 98%– Sensitity in medical literature

• Accuracy: # right/total = 143/150 =~95%

Other Evaluation Schemes

• Leave-one-out cross-validation– Cross-validation where n = number of training

instanced

• Specific train and test set– Allows for exact replication– Ok if train/test large, e.g. 10,000 range.

Bootstrap sampling

• Randomly select n with replacement from n

• Expect about 2/3 to be chosen for training– Prob of not chosen = (1-1/n)^n ~ 1/e.

• Testing on remainder

• Repeat about 30 times and average.

• Avoids partition bias

Documents

Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand