26
A Study of Academic Performance using Random Forest, Artificial Neural Network, Naïve Bayesian and Logistic Regression Nurissaidah Ulinnuha

Nurissaidah Ulinnuha

  • Upload
    angus

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

A Study of Academic Performance using Random Forest, Artificial Neural Network, Naïve Bayesian and Logistic Regression . Nurissaidah Ulinnuha. Introduction. LITERATURE REVIEW. Artificial Neural Network. Superiority - PowerPoint PPT Presentation

Citation preview

Page 1: Nurissaidah Ulinnuha

A Study of Academic Performance using Random Forest,

Artificial Neural Network, Naïve Bayesian and Logistic Regression

Nurissaidah Ulinnuha

Page 2: Nurissaidah Ulinnuha

IntroductionStudent academic performance

(1990-2010)

Logistic Regression

Naïve Bayessian

Artificial Neural Network

Student Academic performance 2011

Random Forest Decission Tree

Page 3: Nurissaidah Ulinnuha

LITERATURE REVIEW

Page 4: Nurissaidah Ulinnuha

Artificial Neural Network

Superiority• ANN is useful for application in several

areas, including pattern recognition, classification, forecasting, process control, etc.

• Robust for noisy dataset

Page 5: Nurissaidah Ulinnuha

Limitation• ANNs do not have parametric statistical

properties (e.g. they do not have individual coefficient or model significance tests based on the t and F distributions).

• ANN may converge to local instead of global minima, thereby providing non-optimal data fits.

Page 6: Nurissaidah Ulinnuha

Logistic Regression

Superiority• LR is able to provide information about

significance value of predictor• There are no assumption about normality of

dataset.

Page 7: Nurissaidah Ulinnuha

Limitation• Only able to work with binary criterion

variable

Page 8: Nurissaidah Ulinnuha

Naïve Bayessian

Superiority• Naïve bayessian requires data training fewer

than other Classsification method

Limitation• Dataset should satisfy independent

assumption

Page 9: Nurissaidah Ulinnuha

Random Forest Decision TreeSuperiority• Random Forest runs efficiently on large databases.• Random Forest can handle thousands of input variables

without variable deletion.• Random Forest gives estimates of what variables are

important in the classification.• Random Forest has an effective method for estimating

missing data and maintains accuracy when a large proportion of the data are missing.

• Random forest able to do classification, clustering and outlier detection

Page 10: Nurissaidah Ulinnuha

Limitation• Random forests have been observed to overfit

for some datasets with noisy classification/regression tasks.

• Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

Page 11: Nurissaidah Ulinnuha

Prior Research

Page 12: Nurissaidah Ulinnuha

Mukta Paliwal and Usha KumarTitle

Academic performance of business school graduates using neural network and statistical techniques.

OverviewThis research compare ANN with several statistical techniques. Paliwal conclude that the superior performance of the neural network techniques as compared to regression analysis for prediction problem whereas performance of neural network is comparable to logistic regression and discriminant analysis for classification problem.

Page 13: Nurissaidah Ulinnuha

J. ZimmermanTitle

Predicting graduate-level performance from undergraduate achievements

ResultThis research predicting graduate-level performance using random forest decision tree. From this research, we get information that random forest is not only able to do classification but also explain about significance of variable

Page 14: Nurissaidah Ulinnuha

Data and Methods

Page 15: Nurissaidah Ulinnuha

Raw dataDATA GRADUATION OF INFORMATICS ENGINEERING MAGISTER STUDENT ITS (2008-2011)

Student ID Date of birthMagister

GPAMarried status Gender

Years of graduation

Scholar GPA

Scholar university Work status

51072010xx 5/28/1972 3.88 T L 99 2.72 ITS FTIF-ITS

510720101x 12/17/1977 3.80 T L 100 3.09 ITS

T.INFORMATIKA ITS

510720101x 8/3/1979 3.75 T P 99 3.19 ITS PPNS-ITS

510720101x 1/16/1976 3.31 T L 98 3.38

Universitas Dr. Soetomo

Politeknik Negeri Samarinda

510720101x 8/28/1979 3.50 T L 99 3.37

universitas hasanuddin

universitas tadulako

510720101x 8/11/1973 3.46 T L 99 3.15 STIKOM STIKOM

5107201x0x 7/16/1982 3.59 T L 101 3

Institut Africain de

ManagemenInstitut Africain de

Managemen

Page 16: Nurissaidah Ulinnuha

Preprocess (165 field)

• Filter data with null value• Change all attribute to number value• Change class attribute to nominal value

Page 17: Nurissaidah Ulinnuha

DatasetDATA GRADUATION OF INFORMATICS ENGINEERING MAGISTER STUDENT

ITS (2008-2011)

No Marital Status Gender Scholar university

Period of study Work status Scholar GPA Age (new

student)Class

(Magister GPA)

1 0 1 10 4 1 3.32 31B2 0 1 10 3 1 3.57 28B3 0 1 9 2 1 3.13 35A4 0 1 9 2 1 3.2 30A5 0 1 9 4 1 2.96 28A6 0 1 9 2 1 3.01 30A7 0 1 10 2 0 3.72 24A8 0 1 8 3 0 3.03 23A9 0 0 10 3 1 3.29 29A

10 0 1 10 2 0 3.64 22A11 0 1 10 3 1 3.28 27A12 0 1 7 5 0 2.57 30B13 0 1 10 3 1 3.15 33B

14 0 1 8 3 1 15.75 24B

Page 18: Nurissaidah Ulinnuha

No Variable Name Information Value

1 Marital Status Marital status when take magister college 0 = not married1 = married

2 Gender Gender of magister student 0 = woman1 = man

3 Scholar University

Rating of university with scale from 1-10 from survey of Webomatrics

10 = 35 big first rank9 = 35 big second rank, etc

4 Period of Study Time period for study Nominal (2-4 years)

5 Work Status Work status when take magister college 0 = not work1 = work

6 Scholar GPA GPA value at scholar Nominal (0-4)

7 Age (new student)

Age when take magister college Nominal

Information of Dataset Fitur 7 fitur and 104 field

Page 19: Nurissaidah Ulinnuha

Class• A : GPA > 3.5• B : GPA <= 3.5

ToolsWeKa

Page 20: Nurissaidah Ulinnuha

Result

Page 21: Nurissaidah Ulinnuha

Multilayer Perceptron Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 84.62% 73.08% 71.43%Incorrectly Classified Instances 15.38% 26.92% 28.57%Kappa Statistic 0.5649 0.2877 0.3243Mean Absolute Error 0.2701 0.3486 0.3971Root Mean Absolute Error 0.354 0.4764 0.5047Relative Absolute Error 65.53% 84.51% 90.96% Logistic Regression Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 77.88% 75.96% 65.71%Incorrectly Classified Instances 22.12% 24.04% 34.29%Kappa Statistic 0.3197 0.2786 0.1286Mean Absolute Error 0.3343 0.3658 0.3988Root Mean Absolute Error 0.408 0.4447 0.452Relative Absolute Error 81.10% 88.68% 91.35%

Page 22: Nurissaidah Ulinnuha

Naïve Bayes Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 76.92% 75.96% 68.57%Incorrectly Classified Instances 23.08% 24.04% 31.43%Kappa Statistic 0.2811 0.2415 0.2159Mean Absolute Error 0.2585 0.2737 0.32Root Mean Absolute Error 0.4504 0.4611 0.4952Relative Absolute Error 62.73% 66.35% 73.28% Random Forest Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 100% 68.27% 68.57%Incorrectly Classified Instances 0% 42.31% 31.43%Kappa Statistic 1 -0.0732 0.2159Mean Absolute Error 0.0933 0.3952 0.3686Root Mean Absolute Error 0.1431 0.5162 0.5235Relative Absolute Error 22.63% 95.82% 84.41%

Page 23: Nurissaidah Ulinnuha

Discussion and Future Work

Page 24: Nurissaidah Ulinnuha

Discussion

• Data training composition influence the performance of classifier technique.

• Random Forest analysis is overfit for some dataset. • Random Forest in accuracy is not better than other

methods for dataset with small fitur

Page 25: Nurissaidah Ulinnuha

Future Works

• Discard unimportant atribut dataset using Principal Component analysis.

• Finding any method to solve overfitting problem of Random Forest Decision Tree

Page 26: Nurissaidah Ulinnuha

Thank you