Nurissaidah Ulinnuha

A Study of Academic Performance using Random Forest,

Artificial Neural Network, Naïve Bayesian and Logistic Regression

Nurissaidah Ulinnuha

IntroductionStudent academic performance

(1990-2010)

Logistic Regression

Naïve Bayessian

Artificial Neural Network

Student Academic performance 2011

Random Forest Decission Tree

LITERATURE REVIEW

Artificial Neural Network

Superiority• ANN is useful for application in several

areas, including pattern recognition, classification, forecasting, process control, etc.

• Robust for noisy dataset

Limitation• ANNs do not have parametric statistical

properties (e.g. they do not have individual coefficient or model significance tests based on the t and F distributions).

• ANN may converge to local instead of global minima, thereby providing non-optimal data fits.

Logistic Regression

Superiority• LR is able to provide information about

significance value of predictor• There are no assumption about normality of

dataset.

Limitation• Only able to work with binary criterion

variable

Naïve Bayessian

Superiority• Naïve bayessian requires data training fewer

than other Classsification method

Limitation• Dataset should satisfy independent

assumption

Random Forest Decision TreeSuperiority• Random Forest runs efficiently on large databases.• Random Forest can handle thousands of input variables

without variable deletion.• Random Forest gives estimates of what variables are

important in the classification.• Random Forest has an effective method for estimating

missing data and maintains accuracy when a large proportion of the data are missing.

• Random forest able to do classification, clustering and outlier detection

Limitation• Random forests have been observed to overfit

for some datasets with noisy classification/regression tasks.

• Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

Prior Research

Mukta Paliwal and Usha KumarTitle

Academic performance of business school graduates using neural network and statistical techniques.

OverviewThis research compare ANN with several statistical techniques. Paliwal conclude that the superior performance of the neural network techniques as compared to regression analysis for prediction problem whereas performance of neural network is comparable to logistic regression and discriminant analysis for classification problem.

J. ZimmermanTitle

Predicting graduate-level performance from undergraduate achievements

ResultThis research predicting graduate-level performance using random forest decision tree. From this research, we get information that random forest is not only able to do classification but also explain about significance of variable

Data and Methods

Raw dataDATA GRADUATION OF INFORMATICS ENGINEERING MAGISTER STUDENT ITS (2008-2011)

Student ID Date of birthMagister

GPAMarried status Gender

Years of graduation

Scholar GPA

Scholar university Work status

51072010xx 5/28/1972 3.88 T L 99 2.72 ITS FTIF-ITS

510720101x 12/17/1977 3.80 T L 100 3.09 ITS

T.INFORMATIKA ITS

510720101x 8/3/1979 3.75 T P 99 3.19 ITS PPNS-ITS

510720101x 1/16/1976 3.31 T L 98 3.38

Universitas Dr. Soetomo

Politeknik Negeri Samarinda

510720101x 8/28/1979 3.50 T L 99 3.37

universitas hasanuddin

universitas tadulako

510720101x 8/11/1973 3.46 T L 99 3.15 STIKOM STIKOM

5107201x0x 7/16/1982 3.59 T L 101 3

Institut Africain de

ManagemenInstitut Africain de

Managemen

Preprocess (165 field)

• Filter data with null value• Change all attribute to number value• Change class attribute to nominal value

DatasetDATA GRADUATION OF INFORMATICS ENGINEERING MAGISTER STUDENT

ITS (2008-2011)

No Marital Status Gender Scholar university

Period of study Work status Scholar GPA Age (new

student)Class

(Magister GPA)

1 0 1 10 4 1 3.32 31B2 0 1 10 3 1 3.57 28B3 0 1 9 2 1 3.13 35A4 0 1 9 2 1 3.2 30A5 0 1 9 4 1 2.96 28A6 0 1 9 2 1 3.01 30A7 0 1 10 2 0 3.72 24A8 0 1 8 3 0 3.03 23A9 0 0 10 3 1 3.29 29A

10 0 1 10 2 0 3.64 22A11 0 1 10 3 1 3.28 27A12 0 1 7 5 0 2.57 30B13 0 1 10 3 1 3.15 33B

14 0 1 8 3 1 15.75 24B

No Variable Name Information Value

1 Marital Status Marital status when take magister college 0 = not married1 = married

2 Gender Gender of magister student 0 = woman1 = man

3 Scholar University

Rating of university with scale from 1-10 from survey of Webomatrics

10 = 35 big first rank9 = 35 big second rank, etc

4 Period of Study Time period for study Nominal (2-4 years)

5 Work Status Work status when take magister college 0 = not work1 = work

6 Scholar GPA GPA value at scholar Nominal (0-4)

7 Age (new student)

Age when take magister college Nominal

Information of Dataset Fitur 7 fitur and 104 field

Class• A : GPA > 3.5• B : GPA <= 3.5

ToolsWeKa

Result

Multilayer Perceptron Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 84.62% 73.08% 71.43%Incorrectly Classified Instances 15.38% 26.92% 28.57%Kappa Statistic 0.5649 0.2877 0.3243Mean Absolute Error 0.2701 0.3486 0.3971Root Mean Absolute Error 0.354 0.4764 0.5047Relative Absolute Error 65.53% 84.51% 90.96% Logistic Regression Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 77.88% 75.96% 65.71%Incorrectly Classified Instances 22.12% 24.04% 34.29%Kappa Statistic 0.3197 0.2786 0.1286Mean Absolute Error 0.3343 0.3658 0.3988Root Mean Absolute Error 0.408 0.4447 0.452Relative Absolute Error 81.10% 88.68% 91.35%

Naïve Bayes Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 76.92% 75.96% 68.57%Incorrectly Classified Instances 23.08% 24.04% 31.43%Kappa Statistic 0.2811 0.2415 0.2159Mean Absolute Error 0.2585 0.2737 0.32Root Mean Absolute Error 0.4504 0.4611 0.4952Relative Absolute Error 62.73% 66.35% 73.28% Random Forest Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 100% 68.27% 68.57%Incorrectly Classified Instances 0% 42.31% 31.43%Kappa Statistic 1 -0.0732 0.2159Mean Absolute Error 0.0933 0.3952 0.3686Root Mean Absolute Error 0.1431 0.5162 0.5235Relative Absolute Error 22.63% 95.82% 84.41%

Discussion and Future Work

Discussion

• Data training composition influence the performance of classifier technique.

• Random Forest analysis is overfit for some dataset. • Random Forest in accuracy is not better than other

methods for dataset with small fitur

Future Works

• Discard unimportant atribut dataset using Principal Component analysis.

• Finding any method to solve overfitting problem of Random Forest Decision Tree

Thank you

Documents

Nurissaidah Ulinnuha