Simpler Machine Learning with SKLL

Simpler Machine Learning with SKLL

Dan Blanchard Educational Testing Service

[email protected]

PyData NYC 2013

mailto:[email protected]

Survived Perished

Survived Perishedfirst class, female,

1 sibling, 35 years old



third class, female,

2 siblings, 18 years old





second class, male,






second class, male,


Can we predict survival from data?

SciKit-Learn Laboratory

SKLL

SKLL

SKLL

It's where the learning happens.

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)$ ./make_titanic_example_data.py !Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done

Learning to Predict Survival2. Pick classifiers to try:

1. Random forest

2. Support Vector Machine (SVM)

3. Naive Bayes

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output



directory with feature files for training learner



Learning to Predict Survival

directory with feature files for evaluating performance

3. Create configuration file for SKLL





# of siblings, spouses, parents, children




departure port




fare & passenger class




sex, & age






directory to store evaluation results




directory to store trained models



directory to store trained models

Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg !Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...


Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate !+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296

5. Examine results

Dev. Accuracy

Learner

0.821 RandomForestClassifier

0.771 SVC

0.709 MultinomialNB

Aggregate Evaluation Results

Tuning learner• Can we do better than default hyperparameters?

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Tuning learner• Can we do better than default hyperparameters?

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Using All Available Data

Using All Available Data• Use training and dev to generate predictions on test

[General] experiment_name = Titanic_Predict task = predict ![Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Using All Available Data• Use training and dev to generate predictions on test

Untuned Accuracy

(Train only)

Tuned Accuracy

(Train only)

Untuned Accuracy

(Train + Dev)

Tuned Accuracy

(Train + Dev)Learner

0.732 0.746 0.746 0.756 RandomForestClassifier

0.608 0.617 0.612 0.641 SVC

0.627 0.623 0.622 0.622 MultinomialNB

Test Set Performance

Advanced SKLL Features

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data


and .tsv data

• Parameter grids for all supported classifiers/regressors


and .tsv data


• Parallelize experiments on DRMAA clusters


and .tsv data



• Ablation experiments


and .tsv data




• Collapse/rename classes from config file


and .tsv data





• Rescale predictions to be closer to observed data


and .tsv data






• Feature scaling


and .tsv data






• Feature scaling

• Python API

Currently Supported Learners

Classifiers Regressors

Linear Support Vector Machine Elastic Net

Logistic Regression Lasso

Multinomial Naive Bayes Linear

Decision Tree

Gradient Boosting

Random Forest

Support Vector Machine

Coming Soon

Classifiers Regressors

AdaBoost

K-Nearest Neighbors

Stochastic Gradient Descent

Acknowledgements• Mike Heilman

• Nitin Madnani

• Aoife Cahill

References• Dataset: kaggle.com/c/titanic-gettingStarted

• SKLL GitHub: github.com/EducationalTestingService/skll

• SKLL Docs: skll.readthedocs.org

• Titanic configs and data splitting script in examples dir on GitHub

@Dan_S_Blanchard !

dan-blanchard

http://kaggle.com/c/titanic-gettingStarted

http://github.com/EducationalTestingService/skll

http://skll.readthedocs.org

Bonus Slides

Cross-validation[General] experiment_name = Titanic_CV task = cross_validate ![Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Avg. CV Accuracy

Learner

0.815 RandomForestClassifier

0.717 SVC

0.681 MultinomialNB

Cross-validation Results

SKLL API

SKLL APIfrom skll import Learner, load_examples


# Load training examplestrain_examples = load_examples('myexamples.megam')



# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)




# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)





confusion matrix









precision, recall, f-score for each class





tuned model parameters





objective function score on test set









# Generate predictions from trained modelpredictions = learner.predict(test_examples)






# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)







per-fold evaluation results







per-fold training set obj. scores

SKLL APIimport numpy as np import os from skll import write_feature_file !# Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)

Technology

Simpler Machine Learning with SKLL