Personalized Recommendation System for Students …...Pavan Kotha 113050010 Master Thesis Under the guidance of Prof. D.B PHATAK Computer Science Department, IIT Bombay 1/66 Predicting

Predicting Student PerformanceClassification

Recommendation SystemConclusion

Personalized Recommendation System for Students by GradePrediction

Pavan Kotha

113050010

Master Thesis

Under the guidance ofProf. D.B PHATAK

Computer Science Department,IIT Bombay

1 / 66



Outline

Motivation

Problem Statement

Prediction of Endsem Marks and Grades

Theoretical Model for personalized recommendation system

Conclusion and Future Work

2 / 66



Motivation

Different learners - diff modes of learning,educationalbackgrounds,previous experience,intelligence

Different people process knowledge in different ways.So,different peoplewill relate to a particular learning resource in different ways.

Human instructors can learn which style of presentation suits which learnerand adjust their mode of presentation.

Different learners need to focus on different material to achieve the samelearning objective.

Need to classify into different Learner types(Achiever,Struggler etc)

3 / 66



Problem Statement

Proposed learning system:Gives learning support based on predicted gradeof student.

Different learning proposals provided to students in feedback according tostudent target class.

4 / 66



Need for PredictionRegression

Need for Prediction

ML in education- to study the data available in the educational field, bringout the hidden knowledge from it

Instructors can identify weak students, help them to score better marks inendsem by giving additional practise questions.

Appropriate measures - taken to improve their performance in endsem i.eprovide customized references so that he can read those materials andimprove his grade.

Eg:Additional customized practise questions, different text books fordifferent learners etc

5 / 66




Learning from Data

ML branch of AI - deals with study of systems that can learn from data.

ML focuses on prediction based on known properties learned from thetraining data.Eg:Email (spam,non spam)

So learn from one offering of CS101 course and predict marks and gradesof another offering

6 / 66




Training Data

Statistical data is represented in terms of tuples

Data consists of many attributes

Training data D = {(x1, y1), .., (xN , yN)}

Target Attribute to be predicted for test data

Training goal: To learn function f (x) → y s.t prediction error on unseeninstances is small

Learner extracts about the distribution.

7 / 66




Proposed System

Raw Data Pre−Processing between attributesfrom train data usinglinear approximation

Predict EndsemMarks of testdata

Train Data

ClassificationAlgorithms

TestData

GeneratedModel

ModelGenerated

Predicts Gradesof test data of classifier

Find accuracy

Capture relationship

Figure: Schematic diagram of system

8 / 66




Predicting Endsem Marks

Attributes taken: assignment, quiz, midsem. Exclude project.

Don’t have any information about the distribution of data

We can find mean,variance and correlation between attributes from thegiven training data

Relationship between the attributes.eg: Midsem good marks, mostlyendsem also good marks.

Need to capture such a relationship from training data

9 / 66




Linear Model in Single Variable

Linear function: Y = a+ bX which can best predict the target variable

Y - endsem marks, X assignment or quiz or midsem marks.

How to find parameters a,b from training data?

To obtain the best linear function choose ’a’ and ’b’ s.t prediction error isminimised.

Optimization Objective: Minimise E [(Y − (a+ bX ))2]

a = E [Y ]− bE [X ], b = ρσy

σx

10 / 66




Linear Model in Single Variable(Example)

4

5

6

7

8

9

10

1 2 3 4 5 0x

y

Figure: Best line fit for data points

Four (x , y) data points: (1, 6), (2, 5), (3, 7) and (4, 10)

11 / 66




Linear Model in Single Variable(Example) contd.

We need to find a line y = a+ bx that best fits these four points(trainingdata).

Best approximation: minimize the sum of squared differences(i.e error)between the predicted data values and their corresponding actual values.

Error functionS(a,b) = [6− (a+b)]2 +[5− (a+2b)]2 +[7− (a+3b)]2 +[10− (a+4b)]2

Equate partial derivatives of S(a,b) with respect to a, b to zero. We geta = 3.5, b = 1.4

Best fit line is y = 3.5 + 1.4x

12 / 66




Linear Model in Single Variable(Example) contd.

endsem=f (midsem)

endsem=f (assignment)

endsem=f (quiz)

Taken average of three endsem marks

13 / 66




Linear Model in Multiple Variables

endsem=f (quiz , assignment,midsem)

Mathematically∑3

j=1(Xijbj + b0) = yi , (i = 1, 2, . . . ,m)

In matrix form Xb = y where X =

1 X12 X13 X14

1 X22 X23 X24

......

...1 Xm2 Xm3 Xm4

b =

b0b1b2b3

y =

y1y2...ym

Overdetermined system with m linear equations and 4unknowns(b0, b1, b2, b3).

14 / 66




Linear Model in Multiple Variable contd.

Goal: Find the coefficients b which fit the equations best in the sense ofminimising mean square error(=E [(actualvalue − predictedvalue)2]).

b̂ = argminb

S(b), , b̂ refers to optimal value of b where the objective

function S is given by S(b) =∑m

i=1

∣

∣yi −∑3

j=1(Xijbj + b0)∣

∣

2=

∥

∥y − Xb∥

∥

2.

Differentiating results in b̂ = (XTX)−1XTy

endsemmarks = b0 + b1assignment + b2quiz + b3midsem

15 / 66




Root mean square error

Endsem Prediction Root Mean Square Error

Linear Model in Single Variable 5. 09

Linear Model in Multiple Variable 5. 1

Table: Root Mean Square Error

Good value for the RMS depends on data we are trying to model orestimate.

Analysis result: 5% to 20% error is good estimate.

Our error is 5%. So we have achieved good estimates of endsem marksusing our model.

16 / 66



Decision TreesNearest NeighbourLDASVMClassifiers combination

Classification

Classification -type of prediction where predicted variable is binary orcategorical.

Classification methods-decision trees,nearest neighbour,LDA,SVM -applied on educational data for predicting grade in examination

Classifier- mathematical function implemented by classification algorithmthat maps input data to a category.

No single classifier works best on all given problems

Determining a suitable classifier for a given problem is more an art than ascience

Type of classifer to be chosen depends on problem we are trying to solve.

17 / 66




Dataset

Figure: Training Dataset for playing tennis

[Tom Mitchell]

18 / 66




Play Tennis Dataset

Target attribute to be predicted is Play Tennis

Classification Problem bcz Play Tennis={yes,no}

Learned function is represented by a decision tree

Learned trees can also be represented by if-then rules

19 / 66




Decision Tree for Dataset

Figure: Learned Decision Tree

[Tom Mitchell]

20 / 66




Predict Target Variable

Decision tree classifies whether morning of a day is suitable for playingtennis or not

Attributes of morning of day• Outlook = Sunny• Temparature=Hot• Humidity=High• Wind=Strong

PlayTennis = no

Given a new tuple we predict whether he can play Tennis or not bytraversing decision tree.

21 / 66




Statistical Test

Statistical test determines how well each attribute alone classifies thetraining examples.

Best attribute is selected to test at the root node of the tree

Training examples are splitted to the branch corresponding to theexample’s value for this attribute.

22 / 66




Information Gain

Attribute to be tested at a node for classifying examples depends oninformation gain(statistical property) of attribute.

Information gain measures how well a given attribute separates thetraining examples according to their target classification.

23 / 66




Decision Tree Learning Algorithm

S:Training Examples, Target attribute:c different values.

Entropy of S relative to this c-wise classification:• Entropy(S)=

∑ci=1 −pi log2pi where pi is proportion of S belonging to class i.

• Information gain of attribute A

• Gain(S,A)=Entropy(S)-∑

v∈Values(A)|Sv ||S|

Entropy(Sv ) where Values(A) is

the set of all possible values for attribute A• Sv is the subset of S for which attribute A has value v.

24 / 66




Learning Algorithm

In given dataset, S is a collection of 14 examples.

Attribute Wind has the values Weak or Strong.

9 positive(i.e play tennis=yes) and 5 negative examples denoted as [9+,5-].

Of these 14 examples, 6 of the positive and 2 of the negative exampleshave Wind = Weak.

25 / 66




Learning Algorithm

Information gain from attribute Wind is calculated as follows:

Values(Wind)=Weak,Strong

S=[9+,5-]

Sweak=[6+,2-]

Sstrong=[3+,3-]

Gain(S,Wind)=Entropy(S)-∑

v∈Weak,Strong

|Sv ||S|

Entropy(Sv )

=Entropy(S)-(8/14)Entropy(Sweak)-(6/14)Entropy(Sstrong )

=0.940-(8/14)0.811-(6/14)1.00

=0.048

26 / 66




Information Gains of attributes

Information gain:Selects best attribute at each step in constructing the tree

Information gain values for all four attributes are:• Gain(S, Outlook) = 0.246• Gain(S, Humidity) = 0.151• Gain(S,Wind) = 0.048• Gain(S, Temperature) = 0.029

27 / 66




Constructing Tree

Outlook attribute provides the best prediction of the target attribute.

Outlook is selected as the decision attribute for the root node

Training examples - splitted to the branch corresponding to the example’svalue.

Every example Outlook = Overcast is a positive example.

Outlook = Sunny and Outlook = Rain do not have all positive

Repeat procedure using only the training examples associated with thatnode.

28 / 66




Partial Decision Tree

Figure: Partial Decision Tree

[Tom Mitchell]

29 / 66




Grades prediction using Decision Trees

Showing marks of other students not good-not done in US

Build decision tree using historical information(model works per prof)

Attributes are assignment,quiz, midsem, project, predicted endsem marks.

Student yet to write endsem-know his grade approx

Improve performance in endsem and improve their grade

30 / 66




Accuracy using Decision Tree Classifier

Classifier Name Accurate Accuracy(Single Variable) Approx. Accuracy(Single Variable) Accurate Accuracy(Multi Variable) Approx. Accuracy(Multi Variable)

DecisionTree 65. 32 98. 24 65. 14 98. 42

Table: Accuracy using Decision Trees Classifier

0

20

40

60

80

100

DecisionTree

Acc

urac

y

Decision Tree Accuracy

Accurate Accuracy Single VariableApprox. Accuracy Single Variable

Accurate Accuracy Multi VariableApprox. Accuracy Multi Variable

Figure: Accuracy using Decision Trees

31 / 66




Nearest Neighbour Algorithm

Non-parametric method for classifying objects based on closest trainingexamples.

An object is assigned to the class of its nearest neighbor

Neighbors taken from a set of objects for which the correct classification isknown

Training examples:Vectors in multidimensional feature space each with aclass label.

Test point is classified by assigning the label which is closest to that pointin the training examples.(euclidean distance as distance metric.)

32 / 66




Example for Nearest Neighbour Algorithm

x

y

Figure: Example for Nearest Neighbour

33 / 66




Accuracy achieved using Nearest Neighbour Algorithm

Classifier Name Accurate Accuracy(Single Variable) Approx. Accuracy(Single Variable) Accurate Accuracy(Multi Variable) Approx. Accuracy(Multi Variable)

Nearest Neighbour 66. 37 97. 89 66. 02 98. 59

Table: Accuracy using Nearest Neighbour Classifier

0

20

40

60

80

100

NearestNeigbour

Acc

urac

y

Nearest Neighbour Accuracy



Figure: Accuracy using Nearest Neighbour

34 / 66




Nearest Neighbour Algorithm with Clustering

Drawback: mis-classification occurs when class distribution is skewed

Distribution of grades is not uniform. Grades like BB and BC dominates

Examples of a more frequent class tend to dominate the prediction of thenew example

Reduced the data set by replacing a cluster of similar grades, regardless oftheir density in the original training data with single point which is itscluster center.

Nearest Neighbour is then applied to the reduced data set.

35 / 66




Accuracy using Nearest Neighbour Algorithm with Clustering

Classifier Name Accurate Accuracy(Single Variable) Approx Accuracy(Single Variable) Accurate Accuracy(Multi Variable) Approx Accuracy(Multi Variable)

Nearest Neighbour Cluster 66. 55 98. 77 63. 56 98. 94

Table: Accuracy using Nearest Neighbour Cluster Classifier

0

20

40

60

80

100

NearestNeighbourCluster

Acc

urac

y

Nearest Neigbour Cluster Accuracy



Figure: Accuracy using Nearest Neighbour Cluster

36 / 66




Linear Discriminant Analysis

LDA - find a linear combination of features which separates two or moreclasses of objects

Training set: Attributes : ~x . Known class label “y”

Classification problem :Find a good predictor for the class “y” given onlyan observation ~x

LDA assumes cpd P(~x |y = 0) ∼ N(~x , ~µ0,Σ) where µ0 is a vectorcontaining means of d-attributes, taking samples whose target label is 0.

P(~x |y = 1) ∼ N(~x , ~µ1,Σ) where µ1 is a vector containing means ofd-attributes, taking samples whose target label is 1.

37 / 66




Linear Discriminant Analysis(contd)

Covariance is a d × d matrix where d refers to number of attributes.

To classify any new observation ~x , we assign label y=0 for thisobservation if

P(y = 0|~x) > P(y = 1|~x)

Solving above equation results in linear decision boundary.

For multi class LDA, grade is max(P(y = i |~x), i = 1..c)

38 / 66




Accuracy using LDA(contd)


LDAC 63. 73 98. 06 64. 26 98. 59

Table: Accuracy using LDAC Classifier

0

20

40

60

80

100

ldac

Acc

urac

y

LDAC Accuracy



Figure: Accuracy using LDAC

39 / 66




Support Vector Machines

SVM based on the concept of decision planes that define decisionboundaries.

A decision plane separates objects which are having different classmemberships.

Many hyperplanes exist that classify the data

SVM tries to find a hyperplane with maximum margin

Most classification tasks - complex decision boundaries

40 / 66




Support Vector Machines Example

1 23

X1

X2

Figure: Linearly Seperable Data

Figure: Non Linearly Seperable Data

41 / 66





Training Data:D = {(xi , yi ) | xi ∈ Rp, yi ∈ {−1, 1}}ni=1

Find hyperplane that seperates two classes and should maximize themargin.

equation of hyperplane(eg.line1) is of form w · x− b = 0

hyperplane2 - w · x− b = 1 (assume: green objects has class label=1)

hyperplane3 - w · x− b = −1 (assume: red objects has class label=-1)

distance between two hyperplanes 2 an 3 is 2‖w‖

.

For maximum margin - minimize ‖w‖.

42 / 66





As training set is linearly seperable we add the constraintsw · xi − b ≥ 1 for xi belongs to green classw · xi − b ≤ −1 for xi belongs to red class.Combining the above two equations, we getyi (w · xi − b) ≥ 1, for all 1 ≤ i ≤ n.

The optimization objective for SVM then becomesMinimize (in w, b) 1

2wTw

subject to (for any i = 1, . . . , n) constraints yi (w · xi − b) ≥ 1

43 / 66




Working procedure of SVM

Input SpaceFeature Space

Figure: Linearly Seperable in Feature Space

44 / 66




Support Vector Machines for Linearly non seperable data

Introduce non-negative slack variables, ξi ≥ 0, one slack variable for eachtraining point.

It measures the degree of misclassification of the data point xi .

Slack variable is 0 for data points that are on or inside the margin

ξi = |yi − (w · xi − b)| for other points.

Data point on decision boundary - w · xi − b = 0, hence ξi = 1

Points which have ξi ≥ 1 are misclassified

45 / 66




Support Vector Machines for Linearly non seperable data

The optimization objective then becomesminimize 1

2wTw + C

∑N

i=1 ξisubject to constraintsyi (w

TΦ(xi)− b) ≥ 1− ξi , and ξi ≥ 0, i = 1, · · · ,N

46 / 66




Accuracy using SVM Classifier


SVMLinear 64. 44 98. 59 64. 79 98. 94

SVMRbf 61. 8 97. 89 66. 37 98. 42

Table: Accuracy using SVM Classifier

0

20

40

60

80

100

LibsvmLinear

Acc

urac

y

Libsvm Linear Accuracy



Figure: Accuracy using Linear Kernel Function

47 / 66




Accuracy using SVM Classifier contd

0

20

40

60

80

100

LibsvmRbf

Acc

urac

y

Libsvm rbf Accuracy



Figure: Accuracy using Gaussian Kernel Function

48 / 66




Combination of Classifiers

Previous models:Single classifiers like decision trees,LDA,NN,SVM

To improve the accuracy-used combination of classifiers

Used many classifiers - not possible to come up with a single classifier thatcan give good results.

Optimal classifier is dependent on problem domain

49 / 66




Combining multiple classifiers(CMC)

Method1:Choose classifier which has least error rate on given dataset• Nearest Neighbour algorithm with Clustering similar classes has least error

rate

Method2: Class getting the maximum votes.

If many classifiers predict student fails, then we assign “fail” label tostudent.

50 / 66




Combination of Classifiers


Classifiers Combination 63. 38 97. 89 65. 49 98. 59

Table: Accuracy using Combination of Classifiers(Method2)

0

20

40

60

80

100

ClassifiersCombination

Acc

urac

y

Classifiers Combination Accuracy



Figure: Accuracy using Classifiers Combination

51 / 66




Difficult Classification Problem

Tried various ways to improve accuracy, but we always got 63 to 67%

Analysis Result:p=11 — difficult classification problem

Even best methods achieve around 40% errors on test data.So we achievedgood accuracy. [18]

Only way to improve accuracy is to reduce the number of class labels.

52 / 66




Target Classes Reduced

Grouped them into four classes

“high” grades - AB, AA, AP

“middle” grades - CC, BC, BB,

“low” grades - DD, CD

“fail” - FR and FF.

53 / 66




Accuracy when target classes reduced

Classifier Name Accurate Accuracy(Single Variable)

Nearest Neighbour 89.43

Table: Accuracy using Nearest Neighbour Classifier when Labels Reduced

accuracy is improved to 89% as compared to 66% when there are 10 classlabels.

Clearly depicts that accuracy depends on number of target classes

54 / 66




Comparison Among Different Classifiers


DecisionTree 65.32 98.24 65.14 98.42

SVMLinear 64.44 98.59 64.79 98.94

SVMRbf 61.8 97.89 66.37 98.42

Nearest Neighbour 66.37 97.89 66.02 98.59

Nearest Neighbour Cluster 66.55 98.77 63.56 98.94

ldac 63.73 98.06 64.26 98.59

Classifiers Combination 63.38 97.89 65.49 98.59

Table: Comparison Among Different Classifiers

55 / 66



Motivation for Recommendation System

No two students are identical, they learn at different rates, differenteducational backgrounds, different intellectual capabilities, different modesof learning.

Need to design a real-time recommendation system which captures thecharacteristics of each student.

Need to ensure that every student progresses through the course materialso that he maximizes his learning.

It is an immensely complex task. No personal care on students

56 / 66



Theoretical Model for Recommendation System

Customized suggestions: Need to know student mastery level on topic

Consider question difficulty level eg.2 students got 5/10 marks

Test questions - no uniform characteristics.

Model question level performance instead of aggregate perf.

Give different internal score - says mastery level in the topic

57 / 66



Theoretical Model for Recommendation System contd

Extract info each question provides about a student.

prob of a correct response to a test question is a mathematical function ofparameters such as a person’s abilities and question characteristics (suchas difficulty, guessability, and specific to topic).

Model helps to better understand how a students performance on testsrelates to his ability

learn about the user, update its parameters about the user to suggest himproperly and make him to achieve his learning objectives.

58 / 66



Theoretical Model for Recommendation System contd

Student struggles with a particular set of questions - our system shouldknow where that particular students weaknesses lie in relation to theconcepts assessed by those questions.

System should then deliver content to increase the students proficiency onthose concepts.

Personalized learning cannot be done just based on total marks.

Attributes for each question: difficulty level, time taken to answer, numberof attempts to get correct answer, marks he secured, probability ofguessing the answer

59 / 66



Theoretical Example

C Declarations,Datatypes

For While

Arrays Strings Structures

PointersStructuresAndArrays

FunctionsandPointers

StructuresControlDecision

Case

StructuresControl

Figure: Dependency Graph for C Language Course

60 / 66



Theoretical Example contd

Difficulty

of topic

Intelligencelevel in

Internalscore on topic

MarksSecuredin Topic

GuessedAnswer

prior topics

Figure: DAG for Internal Score

61 / 66



Theoretical Example contd

Calculate Internal score of the student in that topic - mastery on the topic.

Each node in dependency graph, attributes are mastery level on topic,intelligence level of student, difficulty level of topic etc

Classification algorithms can be applied on this data. Classify him as weakor average or intelligent at topic level and accordingly suggest himpersonalized learning resources

Topic level classification: new video tutorials on this topic, personalizedmaterials, personalized problem sets etc.

62 / 66



Conclusion

Predicting endsem marks, grade of student apriori, providing customizedlearning materials - improve performance in endsem,grade.

Learners -different backgrounds,previous experience, so different learnersneed to focus on different material to achieve the same learning objective.

Proposed learning system gives learning support based on individuallearning characteristic.

Different learning proposals provided to students in feedback - learnerslearning rate.

63 / 66



Future Work

Chapter Selection

Main menu interface

learningProblem based Exam

Feedback ( Customized References)

Figure: Proposed System

Personalized resource recommendation - many granularities-problem level,topic level, course level

Redirect him to a short segment video lecture - problem level

Continuously learn from the system –not single point learning

Struggling with particular questions:Identify them, how similar studentsimproved

64 / 66



References I

Anna Dollar et.al, Enhancing Traditional Classroom Instruction with Web-based Statics Course, 37th

ASEE/IEEE Frontiers in Education Conference,2007

Evgeniya Georgieva et.al, Design of an e-Learning Content Visualization Module, 3rd E-Learning

Conference,2006

Rodica Mihalca et.al, Overview of Tutorial Systems based on Information Technology, 3rd E-Learning

Conference,2006

Herbert A. Simon et.al, A Game Changer,The Open Learning Initiative, From www.cmu.edu,2012

Libing Jiang et.al, Development of electronic teaching material of modern P.E. educational technology upon

problem-based learning, International Conference on e-Education, Entertainment and e-Management,2011

Stephnaos Mavromoustakos et.al, Human And Technological Issues In The E-Learning Engineering Process,

Journal Article in ResearchGate

Jodi L. Davenport et.al, When do diagrams enhance learning? A framework for designing relevant

representations, Proceedings in ICLS’08

Tanja Arh et.al A Case Study of Usability Testing the SUMI Evaluation Approach of the EducaNext Portal,

Wseas Transactions on Information Science And Applications,February 2008

Mirjam Kock et.al Towards Intelligent Adaptative E-Learning Systems – Machine Learning for Learner

Activity Classification European Conference on Technology Enhanced Learning 2010

65 / 66



References II

Behrouz et.al Predicting Student Performance:An application of Datamining methods with an Educational

Web Based System IEEE Frontiers in Education, 2003.

Mingming Zhou et.al Data Mining and Student e-Learning Profiles International Conference on E-Business

and E-Government,2010

S. Anupama Kumar et.al Efficieny of Decision Trees in Predicting Students Academic Performance, CS And

IT-CSCP 2011

R.K Kabra et.al Performance Prediction of Engineering Students using Decision Trees, International Journal

of Computer Applications,December 2011

http://www.meritnation.com/(referred on 30th Aug 2012)

http://www.topperlearning.com/(referred on 30th Aug 2012)

http://www.learnzillion.com/(referred on 31st Aug 2012)

www.it.iitb.ac.in/ekshiksha/(referred on 31st Aug 2012)

The Elements of Statistical Learning by Friedman, Hastie, Tibshirani printed in 2001 , pg 85

66 / 66

Documents

Personalized Recommendation System for Students …...Pavan Kotha 113050010 Master Thesis Under the guidance of Prof. D.B PHATAK Computer Science Department, IIT Bombay 1/66 Predicting