Upload
daniel-blanchard
View
713
Download
0
Embed Size (px)
Citation preview
Simpler Machine Learning with SKLL
Dan Blanchard Educational Testing Service
PyData NYC 2013
Survived Perished
Survived Perishedfirst class, female,
1 sibling, 35 years old
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
second class, male,
0 siblings, 50 years old
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
second class, male,
0 siblings, 50 years old
Can we predict survival from data?
SciKit-Learn Laboratory
SKLL
SKLL
SKLL
It's where the learning happens.
Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)
Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)$ ./make_titanic_example_data.py !Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done
Learning to Predict Survival2. Pick classifiers to try:
1. Random forest
2. Support Vector Machine (SVM)
3. Naive Bayes
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
directory with feature files for training learner
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival
directory with feature files for evaluating performance
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival
# of siblings, spouses, parents, children
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival
departure port
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival
fare & passenger class
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival
sex, & age
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival
directory to store evaluation results
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
directory to store trained models
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
directory to store trained models
Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg !Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...
Learning to Predict Survival
Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate !+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296
5. Examine results
Dev. Accuracy
Learner
0.821 RandomForestClassifier
0.771 SVC
0.709 MultinomialNB
Aggregate Evaluation Results
Tuning learner• Can we do better than default hyperparameters?
[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output
Tuning learner• Can we do better than default hyperparameters?
Untuned Accuracy
Tuned Accuracy
Learner
0.821 0.849 RandomForestClassifier
0.771 0.737 SVC
0.709 0.709 MultinomialNB
Tuned Evaluation Results
Untuned Accuracy
Tuned Accuracy
Learner
0.821 0.849 RandomForestClassifier
0.771 0.737 SVC
0.709 0.709 MultinomialNB
Tuned Evaluation Results
Using All Available Data
Using All Available Data• Use training and dev to generate predictions on test
[General] experiment_name = Titanic_Predict task = predict ![Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output
Using All Available Data• Use training and dev to generate predictions on test
Untuned Accuracy
(Train only)
Tuned Accuracy
(Train only)
Untuned Accuracy
(Train + Dev)
Tuned Accuracy
(Train + Dev)Learner
0.732 0.746 0.746 0.756 RandomForestClassifier
0.608 0.617 0.612 0.641 SVC
0.627 0.623 0.622 0.622 MultinomialNB
Test Set Performance
Advanced SKLL Features
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
• Python API
Currently Supported Learners
Classifiers Regressors
Linear Support Vector Machine Elastic Net
Logistic Regression Lasso
Multinomial Naive Bayes Linear
Decision Tree
Gradient Boosting
Random Forest
Support Vector Machine
Coming Soon
Classifiers Regressors
AdaBoost
K-Nearest Neighbors
Stochastic Gradient Descent
Acknowledgements• Mike Heilman
• Nitin Madnani
• Aoife Cahill
References• Dataset: kaggle.com/c/titanic-gettingStarted
• SKLL GitHub: github.com/EducationalTestingService/skll
• SKLL Docs: skll.readthedocs.org
• Titanic configs and data splitting script in examples dir on GitHub
@Dan_S_Blanchard !
dan-blanchard
Bonus Slides
Cross-validation[General] experiment_name = Titanic_CV task = cross_validate ![Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output
Avg. CV Accuracy
Learner
0.815 RandomForestClassifier
0.717 SVC
0.681 MultinomialNB
Cross-validation Results
SKLL API
SKLL APIfrom skll import Learner, load_examples
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
confusion matrix
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
precision, recall, f-score for each class
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
tuned model parameters
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
objective function score on test set
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
per-fold evaluation results
SKLL APIfrom skll import Learner, load_examples
# Load training examplestrain_examples = load_examples('myexamples.megam')
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
per-fold training set obj. scores
SKLL APIimport numpy as np import os from skll import write_feature_file !# Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)