72
Simpler Machine Learning with SKLL Dan Blanchard Educational Testing Service [email protected] PyData NYC 2013

Simpler Machine Learning with SKLL

Embed Size (px)

Citation preview

Page 1: Simpler Machine Learning with SKLL

Simpler Machine Learning with SKLL

Dan Blanchard Educational Testing Service

[email protected]

PyData NYC 2013

Page 2: Simpler Machine Learning with SKLL
Page 3: Simpler Machine Learning with SKLL
Page 4: Simpler Machine Learning with SKLL
Page 5: Simpler Machine Learning with SKLL

Survived Perished

Page 6: Simpler Machine Learning with SKLL

Survived Perishedfirst class, female,

1 sibling, 35 years old

Page 7: Simpler Machine Learning with SKLL

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

Page 8: Simpler Machine Learning with SKLL

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

0 siblings, 50 years old

Page 9: Simpler Machine Learning with SKLL

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

0 siblings, 50 years old

Can we predict survival from data?

Page 10: Simpler Machine Learning with SKLL

SciKit-Learn Laboratory

Page 11: Simpler Machine Learning with SKLL

SKLL

Page 12: Simpler Machine Learning with SKLL

SKLL

Page 13: Simpler Machine Learning with SKLL

SKLL

It's where the learning happens.

Page 14: Simpler Machine Learning with SKLL

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)

Page 15: Simpler Machine Learning with SKLL

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)$ ./make_titanic_example_data.py !Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done

Page 16: Simpler Machine Learning with SKLL

Learning to Predict Survival2. Pick classifiers to try:

1. Random forest

2. Support Vector Machine (SVM)

3. Naive Bayes

Page 17: Simpler Machine Learning with SKLL

Learning to Predict Survival3. Create configuration file for SKLL

Page 18: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

Page 19: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

directory with feature files for training learner

Learning to Predict Survival3. Create configuration file for SKLL

Page 20: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

directory with feature files for evaluating performance

3. Create configuration file for SKLL

Page 21: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

Page 22: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

# of siblings, spouses, parents, children

3. Create configuration file for SKLL

Page 23: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

departure port

3. Create configuration file for SKLL

Page 24: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

fare & passenger class

3. Create configuration file for SKLL

Page 25: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

sex, & age

3. Create configuration file for SKLL

Page 26: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

Page 27: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

directory to store evaluation results

3. Create configuration file for SKLL

Page 28: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

directory to store trained models

Page 29: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

directory to store trained models

Page 30: Simpler Machine Learning with SKLL

Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg !Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...

Page 31: Simpler Machine Learning with SKLL

Learning to Predict Survival

Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate !+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296

5. Examine results

Page 32: Simpler Machine Learning with SKLL

Dev. Accuracy

Learner

0.821 RandomForestClassifier

0.771 SVC

0.709 MultinomialNB

Aggregate Evaluation Results

Page 33: Simpler Machine Learning with SKLL

Tuning learner• Can we do better than default hyperparameters?

Page 34: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Tuning learner• Can we do better than default hyperparameters?

Page 35: Simpler Machine Learning with SKLL

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Page 36: Simpler Machine Learning with SKLL

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Page 37: Simpler Machine Learning with SKLL

Using All Available Data

Page 38: Simpler Machine Learning with SKLL

Using All Available Data• Use training and dev to generate predictions on test

Page 39: Simpler Machine Learning with SKLL

[General] experiment_name = Titanic_Predict task = predict ![Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Using All Available Data• Use training and dev to generate predictions on test

Page 40: Simpler Machine Learning with SKLL

Untuned Accuracy

(Train only)

Tuned Accuracy

(Train only)

Untuned Accuracy

(Train + Dev)

Tuned Accuracy

(Train + Dev)Learner

0.732 0.746 0.746 0.756 RandomForestClassifier

0.608 0.617 0.612 0.641 SVC

0.627 0.623 0.622 0.622 MultinomialNB

Test Set Performance

Page 41: Simpler Machine Learning with SKLL

Advanced SKLL Features

Page 42: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

Page 43: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

Page 44: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

Page 45: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

Page 46: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

Page 47: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

• Rescale predictions to be closer to observed data

Page 48: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

• Rescale predictions to be closer to observed data

• Feature scaling

Page 49: Simpler Machine Learning with SKLL

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

• Rescale predictions to be closer to observed data

• Feature scaling

• Python API

Page 50: Simpler Machine Learning with SKLL

Currently Supported Learners

Classifiers Regressors

Linear Support Vector Machine Elastic Net

Logistic Regression Lasso

Multinomial Naive Bayes Linear

Decision Tree

Gradient Boosting

Random Forest

Support Vector Machine

Page 51: Simpler Machine Learning with SKLL

Coming Soon

Classifiers Regressors

AdaBoost

K-Nearest Neighbors

Stochastic Gradient Descent

Page 52: Simpler Machine Learning with SKLL

Acknowledgements• Mike Heilman

• Nitin Madnani

• Aoife Cahill

Page 53: Simpler Machine Learning with SKLL

References• Dataset: kaggle.com/c/titanic-gettingStarted

• SKLL GitHub: github.com/EducationalTestingService/skll

• SKLL Docs: skll.readthedocs.org

• Titanic configs and data splitting script in examples dir on GitHub

@Dan_S_Blanchard !

dan-blanchard

Page 54: Simpler Machine Learning with SKLL

Bonus Slides

Page 55: Simpler Machine Learning with SKLL

Cross-validation[General] experiment_name = Titanic_CV task = cross_validate ![Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Page 56: Simpler Machine Learning with SKLL

Avg. CV Accuracy

Learner

0.815 RandomForestClassifier

0.717 SVC

0.681 MultinomialNB

Cross-validation Results

Page 57: Simpler Machine Learning with SKLL

SKLL API

Page 58: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

Page 59: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

Page 60: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

Page 61: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

Page 62: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

confusion matrix

Page 63: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

Page 64: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

precision, recall, f-score for each class

Page 65: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

tuned model parameters

Page 66: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

objective function score on test set

Page 67: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

Page 68: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

Page 69: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

Page 70: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

per-fold evaluation results

Page 71: Simpler Machine Learning with SKLL

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

per-fold training set obj. scores

Page 72: Simpler Machine Learning with SKLL

SKLL APIimport numpy as np import os from skll import write_feature_file !# Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)