Bayesian Knowledge Tracing and Other Predictive Models in Educational Data Mining

Zachary A. Pardos

PSLC Summer School 2011

Bayesian Knowledge Tracing & Other Models PLSC Summer School 2011Zach Pardos

2Bayesian Knowledge Tracing & Other Models PLSC Summer School 2011Zach Pardos

Outline of Talk

• Introduction to Knowledge Tracing– History– Intuition– Model– Demo– Variations (and other models)– Evaluations (baker work / kdd)

• Random Forests– Description– Evaluations (kdd)

• Time left? – Vote on next topic

Intro to Knowledge Tracing

History• Introduced in 1995 (Corbett & Anderson,

UMUAI)• Basked on ACT-R theory of skill knowledge

(Anderson 1993)• Computations based on a variation of

Bayesian calculations proposed in 1972 (Atkinson)

Intuition• Based on the idea that practice on a skill leads

to mastery of that skill • Has four parameters used to describe student

performance• Relies on a KC model• Tracks student knowledge over time

Given a student’s response sequence 1 to n, predict n+1

0 0 0 11 1 ?

For some Skill K:

Chronological response sequence for student Y[ 0 = Incorrect response 1 = Correct response]

1 …. n n+1

0 0 0 11 1 1

Track knowledge over time(model of learning)

P(T) P(T)

Model ParametersP(L0) = Probability of initial knowledgeP(T) = Probability of learningP(G) = Probability of guessP(S) = Probability of slip

Nodes representationK = knowledge nodeQ = question node

Node statesK = two state (0 or 1)Q = two state (0 or 1)

P(G)P(S)

Knowledge Tracing Knowledge Tracing (KT) can be represented as a simple HMM

Latent

Observed

Node representationsK = Knowledge nodeQ = Question node

Node statesK = Two state (0 or 1)Q = Two state (0 or 1)

UMAP 2011 7

P(T) P(T)

P(G)P(S)

Knowledge Tracing Four parameters of the KT model:

P(L0) = Probability of initial knowledgeP(T) = Probability of learningP(G) = Probability of guessP(S) = Probability of slip

UMAP 2011

P(L0) P(T) P(T)

P(G) P(G) P(G) P(S)

Probability of forgetting assumed to be zero (fixed)

Formulas for inference and prediction

• Derivation (Reye, JAIED 2004):

• Formulas use Bayes Theorem to make inferences about latent variable

If (1)

(2) (3)

0 0 11 1

Model Training Step - Values of parameters P(T), P(G), P(S) & P(L0 ) used to predict student responses• Ad-hoc values could be used but will likely not be the best fitting• Goal: find a set of values for the parameters that minimizes prediction error

1 1 1 10 10 0 0 01 0

Student AStudent BStudent C

Model Training:

Model Tracing Step – Skill: Subtraction

P(T) P(T)

P(G)P(S)

Knowledge Tracing

P(T) P(T)

P(G)P(S)

Knowledge Tracing

P(T) P(T)

P(G)P(S)

Knowledge Tracing

Student’s last three responses to Subtraction questions (in the Unit)

Test set questions

Latent (knowledge)

Observable(responses)

10% 45% 75% 79% 83%

71% 74%

Model Prediction:

Influence of parameter values

P(L0): 0.50 P(T): 0.20 P(G): 0.14 P(S): 0.09

Student reached 95% probability of knowledgeAfter 4th opportunity

Estimate of knowledge for student with response sequence: 0 1 1 1 1 1 1 1 1 1

P(L0): 0.50 P(T): 0.20 P(G): 0.14 P(S): 0.09P(L0): 0.50 P(T): 0.20 P(G): 0.64 P(S): 0.03

Student reached 95% probability of knowledgeAfter 8th opportunity

Influence of parameter values

( Demo )

Variations on Knowledge Tracing(and other models)

Prior Individualization Approach

P(T) P(T)

P(G)P(S)

Knowledge Tracing Do all students enter a lesson with the same background knowledge?

P(T) P(T)P(L0|S)

P(G)P(S)

Knowledge Tracing with Individualized P(L0)

Node representationsK = Knowledge nodeQ = Question nodeS = Student node

Node statesK = Two state (0 or 1)Q = Two state (0 or 1)S = Multi state (1 to N)

P(L0|S)

Observed

P(T) P(T)

P(G)P(S)

Knowledge Tracing Conditional Probability Table of Student node and Individualized Prior node

P(T) P(T)P(L0|S)

P(G)P(S)

Knowledge Tracing with Individualized P(L0)P(L0|S)

S value P(S=value)

… …N 1/N

CPT of Student node

• CPT of observed student node is fixed• Possible to have S value for every student ID• Raises initialization issue (where do these prior values come from?)

S value can represent a cluster or type of student

instead of ID

P(T) P(T)

P(G)P(S)

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

1 0.05

2 0.30

3 0.95

… …N 0.92

CPT of Individualized Prior node• Individualized L0 values need to be seeded• This CPT can be fixed or the values can be learned• Fixing this CPT and seeding it with values based on a student’s first response can be an effective strategy

This model, that only individualizes L0, the Prior Per

Student (PPS) model

P(L0|S)

P(T) P(T)

P(G)P(S)

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

0 0.05

1 0.30

CPT of Individualized Prior node• Bootstrapping prior• If a student answers incorrectly on the first question, she gets a low prior•If a student answers correctly on the first question, she gets a higher prior

P(L0|S)

P(T) P(T)

P(G)P(S)

Knowledge Tracing What values to use for the two priors?

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

0 0.05

1 0.30

CPT of Individualized Prior node

What values to use for the two priors?

P(L0|S)

P(T) P(T)

P(G)P(S)

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

0 0.10

1 0.85

CPT of Individualized Prior node1. Use ad-hoc values

P(L0|S)

P(T) P(T)

P(G)P(S)

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

CPT of Individualized Prior node1. Use ad-hoc values2. Learn the values

P(L0|S)

P(T) P(T)

P(G)P(S)

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

0 Slip

1 1-Guess

CPT of Individualized Prior node1. Use ad-hoc values2. Learn the values3. Link with the

guess/slip CPT

P(L0|S)

P(T) P(T)

P(G)P(S)

P(T) P(T)P(L0|S)

P(G)P(S)

S value P(L0|S)

0 Slip

1 1-Guess

CPT of Individualized Prior node1. Use ad-hoc values2. Learn the values3. Link with the

guess/slip CPT

P(L0|S)

With ASSISTments, PPS (ad-hoc) achieved an R2 of 0.301 (0.176 with KT)

(Pardos & Heffernan, UMAP 2010)

UMAP 2011 25

Variations on Knowledge Tracing(and other models)

P(T) P(T)

P(G)P(S)

Knowledge Tracing P(L0) = Probability of initial knowledgeP(T) = Probability of learningP(G) = Probability of guessP(S) = Probability of slip

UMAP 2011

P(L0) P(T) P(T)

(Baker et al., 2010)

1. BKT-BFLearns values for these parameters byperforming a grid search (0.01 granularity)and chooses the set of parameters with the best squared error

P(G) P(G) P(G)P(S) P(S) P(S)

P(T) P(T)

P(G)P(S)

UMAP 2011

P(L0) P(T) P(T)

(Chang et al., 2006)

2. BKT-EMLearns values for these parameters with Expectation Maximization (EM). Maximizes the log likelihood fit to the data

P(T) P(T)

P(G)P(S)

UMAP 2011

P(L0) P(T) P(T)

(Baker, Corbett, & Aleven, 2008)

3. BKT-CGSGuess and slip parameters are assessed contextually using a regression on features generated from student performance in the tutor

P(T) P(T)

P(G)P(S)

UMAP 2011

P(L0) P(T) P(T)

(Baker, Corbett, & Aleven, 2008)

4. BKT-CSlipUses the student’s averaged contextual Slip parameter learned across all incorrect actions.

P(T) P(T)

P(G)P(S)

UMAP 2011

P(L0) P(T) P(T)

(Nooraiei et al, 2011)

5. BKT-LessDataLimits students response sequence length to the most recent 15 during EM training.

P(G) P(G) P(G)P(S) P(S) P(S)Most recent 15 responses used (max)

P(T) P(T)

P(G)P(S)

UMAP 2011

P(L0) P(T) P(T)

(Pardos & Heffernan, 2010)

6. BKT-PPSPrior per student (PPS) model which individualizes the prior parameter. Students are assigned a prior based on their response to the first question.

P(T) P(T)P(L0|S)

P(G)P(S)

Observed

UMAP 2011 32

7. CFARCorrect on First Attempt Rate (CFAR) calculates the student’s percent correct on the current skill up until the question being predicted.

Student responses for Skill X: 0 1 0 1 0 1 _

Predicted next response would be 0.50

(Yu et al., 2010)

UMAP 2011 33

8. TablingUses the student’s response sequence (max length 3) to predict the next response by looking up the average next response among student with the same sequence in the training set

Training setStudent A: 0 1 1 0Student B: 0 1 1 1Student C: 0 1 1 1

Predicted next response would be 0.66

Test set student: 0 0 1 _

Max table length set to 3:Table size was 20+21+22+23=15

(Wang et al., 2011)

UMAP 2011 34

9. PFAPerformance Factors Analysis (PFA). Logistic regression model which elaborates on the Rasch IRT model. Predicts performance based on the count of student’s prior failures and successes on the current skill.

An overall difficulty parameter ᵝ is also fit for each skill or each item In this study we use the variant of PFA that fits ᵝ for each skill. The PFA equation is:

(Pavlik et al., 2009)

• Cognitive Tutor for Genetics– 76 CMU undergraduate students – 9 Skills (no multi-skill steps)– 23,706 problem solving attempts – 11,582 problem steps in the tutor– 152 average problem steps completed per student

(SD=50)– Pre and post-tests were administered with this

assignment

Dataset

MethodologyEvaluation

Study• Predictions were made by the 9 models using a 5 fold cross-validation by student

Methodologymodel in-tutor prediction

Student 1 Skill A Resp 1 0.10 0.22 0

Skill A Resp 2 ….

0.51 0.26 1

Skill A Resp N 0.77 0.40 1

Student 1 Skill B Resp 1 …

0.55 0.60 1

Skill B Resp N 0.41 0.61 0

BKT-BF

BKT-EM…

Actual

• Accuracy was calculated with A’ for each student. Those values were then averaged across students to report the model’s A’ (higher is better)

StudyResultsin-tutor model prediction

Model A’

BKT-PPS 0.7029

BKT-BF 0.6969

BKT-EM 0.6957

BKT-LessData 0.6839

PFA 0.6629

Tabling 0.6476

BKT-CSlip 0.6149

CFAR 0.5705

BKT-CGS 0.4857

A’ results averaged across students

StudyResultsin-tutor model prediction

Model A’

BKT-PPS 0.7029

BKT-BF 0.6969

BKT-EM 0.6957

BKT-LessData 0.6839

PFA 0.6629

Tabling 0.6476

BKT-CSlip 0.6149

CFAR 0.5705

BKT-CGS 0.4857

A’ results averaged across studentsNo significant differences within these BKT

Significant differences between these BKT and PFA

Study• 5 ensemble methods were used, trained with the same 5 fold cross-validation folds

Methodologyensemble in-tutor prediction

• Ensemble methods were trained using the 9 model predictions as the features and the actual response as the label.

Student 1 Skill A Resp 1 0.10 0.22 0

Skill A Resp 2 ….

0.51 0.26 1

Skill A Resp N 0.77 0.40 1

Student 1 Skill B Resp 1 …

0.55 0.60 1

Skill B Resp N 0.41 0.61 0

BKT-BF

BKT-EM…

Actual

features label

Study• Ensemble methods used:

1. Linear regression with no feature selection (predictions bounded between {0,1})2. Linear regression with feature selection (stepwise regression)3. Linear regression with only BKT-PPS & BKT-EM4. Linear regression with only BKT-PPS, BKT-EM & BKT-CSlip5. Logistic regression

Methodologyensemble in-tutor prediction

StudyResultsin-tutor ensemble prediction

Model A’

Ensemble: LinReg with BKT-PPS, BKT-EM & BKT-CSlip 0.7028

Ensemble: LinReg with BKT-PPS & BKT-EM 0.6973

Ensemble: LinReg without feature selection 0.6945

Ensemble: LinReg with feature selection (stepwise) 0.6954

Ensemble: Logistic without feature selection 0.6854

Tabling

No significant difference between ensembles

StudyResultsin-tutor ensemble & model prediction

Model A’BKT-PPS 0.7029Ensemble: LinReg with BKT-PPS, BKT-EM & BKT-CSlip 0.7028Ensemble: LinReg with BKT-PPS & BKT-EM 0.6973BKT-BF 0.6969BKT-EM 0.6957Ensemble: LinReg without feature selection 0.6945Ensemble: LinReg with feature selection (stepwise) 0.6954Ensemble: Logistic without feature selection 0.6854BKT-LessData 0.6839PFA 0.6629Tabling 0.6476BKT-CSlip 0.6149CFAR 0.5705BKT-CGS 0.4857

StudyResultsin-tutor ensemble & model prediction

Model A’

Ensemble: LinReg with BKT-PPS, BKT-EM & BKT-CSlip 0.7451Ensemble: LinReg without feature selection 0.7428Ensemble: LinReg with feature selection (stepwise) 0.7423Ensemble: Logistic regression without feature selection 0.7359Ensemble: LinReg with BKT-PPS & BKT-EM 0.7348BKT-EM 0.7348BKT-BF 0.7330BKT-PPS 0.7310PFA 0.7277BKT-LessData 0.7220CFAR 0.6723Tabling 0.6712Contextual Slip 0.6396BKT-CGS 0.4917

A’ results calculated across all actions

In the KDD Cup• Motivation for trying non KT approach:– Bayesian method only uses KC, opportunity count and

student as features. Much information is left unutilized. Another machine learning method is required

• Strategy:– Engineer additional features from the dataset and use

Random Forests to train a model

Random Forests

• Strategy:– Create rich feature datasets that include features created

from features not included in the test set

raw training dataset rows raw test dataset row

Feature Rich Validation set 2 (frval2)

Feature Rich Validation set 1 (frval1)

Feature Rich Test set (frtest)

Non validation training rows

(nvtrain)

Random Forests

• Created by Leo Breiman• The method trains T number of separate decision tree

classifiers (50-800)• Each decision tree selects a random 1/P portion of the

available features (1/3)• The tree is grown until there are at least M

observations in the leaf (1-100)• When classifying unseen data, each tree votes on the

class. The popular vote wins or an average of the votes (for regression)

Random Forests

Feature ImportanceFeatures extracted from training set:• Student progress features (avg. importance: 1.67)

– Number of data points [today, since the start of unit]– Number of correct responses out of the last [3, 5, 10]– Zscore sum for step duration, hint requests, incorrects– Skill specific version of all these features

• Percent correct features (avg. importance: 1.60)– % correct of unit, section, problem and step and total for each skill and also for each

student (10 features)

• Student Modeling Approach features (avg. importance: 1.32)– The predicted probability of correct for the test row– The number of data points used in training the parameters– The final EM log likelihood fit of the parameters / data points

Random Forests

• Features of the user were more important in Bridge to Algebra than Algebra

• Student progress features / gaming the system (Baker et al., UMUAI 2008) were important in both datasets

Random Forests

Rank Feature set RMSE Coverage1 All features 0.2762 87%

2 Percent correct+ 0.2824 96%

3 All features (fill) 0.2847 97%

Algebra

Bridge to Algebra

Random Forests

Algebra

Bridge to Algebra

• Best Bridge to Algebra RMSE on the Leaderboard was 0.2777• Random Forest RMSE of 0.2712 here is exceptional

Random Forests

Algebra

Bridge to Algebra

• Skill data for a student was not always available for each test row• Because of this many skill related feature sets only had 92% coverage

Random Forests

Conclusion from KDD• Combining user features with skill features was very

powerful in both modeling and classification approaches

• Model tracing based predictions performed formidably against pure machine learning techniques

• Random Forests also performed very well on this educational data set compared to other approaches such as Neural Networks and SVMs. This method could significantly boost accuracy in other EDM datasets.

Hardware/Software• Software– MATLAB used for all analysis

• Bayes Net Toolbox for Bayesian Networks Models• Statistics Toolbox for Random Forests classifier

– Perl used for pre-processing• Hardware– Two rocks clusters used for skill model training

• 178 CPUs in total. Training of KT models took ~48 hours when utilizing all CPUs.

– Two 32gig RAM systems for Random Forests• RF models took ~16 hours to train with 800 trees

Random Forests

Choose the next topic• KT: 1-35• Prediction: 36-67• Evaluation: 47-77• sig tests: 69-77• Regression/sig tests: 80-112

Time left?

UMAP 2011 55

Individualize Everything?

Fully Individualized ModelModel ParametersP(L0) = Probability of initial knowledgeP(L0|Q1) = Individual Cold start P(L0)P(T) = Probability of learningP(T|S) = Students’ Individual P(T)P(G) = Probability of guessP(G|S) = Students’ Individual P(G)P(S) = Probability of slipP(S|S) Students’ Individual P(S)

Node representationsK = Knowledge nodeQ = Question nodeS = Student nodeQ1= first response nodeT = Learning nodeG = Guessing nodeS = Slipping node

Parameters in bold are learnedfrom data while the others are fixed

P(T) P(T)P(L0|Q1)

P(G)P(S)

Student-Skill Interaction Model

Node statesK , Q, Q1, T, G, S = Two state (0 or 1)Q = Two state (0 or 1)S = Multi state (1 to N)(Where N is the number of students in the training data)

P(T|S)

P(G|S) P(S|S)

(Pardos & Heffernan, JMLR 2011)

P(T) P(T)P(L0|Q1)

P(G)P(S)

P(T|S)

P(G|S) P(S|S)

S identifies the student

P(T) P(T)P(L0|Q1)

P(G)P(S)

P(T|S)

P(G|S) P(S|S)

T contains the CPT lookup table of individual student learn rates

P(T) P(T)P(L0|Q1)

P(G)P(S)

P(T|S)

P(G|S) P(S|S)

P(T) is trained for each skill which gives a learn rate for:P(T|T=1) [high learner] and P(T|T=0) [low learner]

SSI model results

Algebra Bridge to Algebra0.279

PPSSSIRM

Dataset New RMSE Prev RMSE Improvement

Algebra 0.2813 0.2835 0.0022

Bridge to Algebra 0.2824 0.2860 0.0036

Average of Improvement is the difference between the 1st and 3rd place. It is also the difference between 3rd and 4th place.

The difference between PPS and SSI are significant in each dataset at the P < 0.01 level (t-test of squared errors)

Bayesian Knowledge Tracing and Other Predictive Models in Educational Data Mining

Documents

Navigating the parameter space of Bayesian Knowledge Tracing models

Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition

Predictive and Contextual Feature Separation for Bayesian Metanetworks

A Computational Analysis of the Bayesian Brain · 2015. 11. 13. · Urn-ball task which is directly related to Bayes‘ theorem Bayesian observer model Bayesian updating and predictive

Quantiative uncertainty in QSAR predictions - Bayesian predictive inference and the magic of bootstrap

Dynamic Bayesian Predictive Synthesis in Time …...2016/07/19 · Dynamic Bayesian Predictive Synthesis in Time Series Forecasting Kenichiro McAlinn & Mike West Department of Statistical

Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

Probabilities Bayesian Predictive

Bayesian Predictive Inference For Finite Population ... · 2.Methodology for Informative Sampling We describe the Bayesian model and inference in Section 2.1, and the computational

Adaptive Traitor Tracing with Bayesian Networks (slides)

Bayesian Model Selection For Incomplete Data using the ... · proposed general Bayesian criteria from the posterior predictive distribution of the data. In general, good models should

LNAI 7524 - A Bayesian Scoring Technique for Mining ... · A Bayesian Scoring Technique for Mining Predictive and Non-Spurious Rules IyadBatal 1,GregoryCooper2,andMilosHauskrecht

Comparison of Bayesian predictive methods for model selection · PDF fileComparison of Bayesian predictive methods for ... lead to over tting in the selection process causing ... of

The Use of Sampling Weights in Bayesian Hierarchical ... · Hierarchical Bayes models developed for binary survey data include Nandram and Sedransk (1993), in which Bayesian predictive

Bayesian Knowledge Tracing Knowledge InferenceBayesian Knowledge Tracing (BKT) The classic approach for measuring tightly defined skill in online learning First proposed by Richard

Lecture 1: Overview of Bayesian Econometricshedibert.org/wp-content/uploads/2013/12/basicbayes.pdfBayesian crank Prior predictive Posterior Posterior predictive Sequential Bayes Model

Bayesian network-based predictive analytics applied to invasive species distribution

Predictive Entropy Search for Multi-objective Bayesian … · 2016. 2. 23. · Predictive Entropy Search for Multi-objective Bayesian Optimization Daniel Hern´andez-Lobato Universidad

The Utility of Bayesian Predictive Probabilities for Interim Monitoring …/media/diaglobal/files/resources/... · · 2016-02-05The Utility of Bayesian Predictive Probabilities

Multi-Entity Bayesian Networks Learning in Predictive Situation … · 2013. 8. 5. · Multi-Entity Bayesian Networks Learning in Predictive Situation Awareness Cheol Young Park [STUDENT]