View
213
Download
0
Category
Preview:
Citation preview
A Study of Academic Performance using Random Forest,
Artificial Neural Network, Naïve Bayesian and Logistic Regression
Nurissaidah Ulinnuha
IntroductionStudent academic performance
(1990-2010)
Logistic Regression
Naïve Bayessian
Artificial Neural Network
Student Academic performance 2011
Random Forest Decission Tree
Artificial Neural Network
Superiority• ANN is useful for application in several
areas, including pattern recognition, classification, forecasting, process control, etc.
• Robust for noisy dataset
Limitation• ANNs do not have parametric statistical
properties (e.g. they do not have individual coefficient or model significance tests based on the t and F distributions).
• ANN may converge to local instead of global minima, thereby providing non-optimal data fits.
Logistic Regression
Superiority• LR is able to provide information about
significance value of predictor• There are no assumption about normality of
dataset.
Naïve Bayessian
Superiority• Naïve bayessian requires data training fewer
than other Classsification method
Limitation• Dataset should satisfy independent
assumption
Random Forest Decision Tree
Superiority• Random Forest runs efficiently on large databases.• Random Forest can handle thousands of input variables
without variable deletion.• Random Forest gives estimates of what variables are
important in the classification.• Random Forest has an effective method for estimating
missing data and maintains accuracy when a large proportion of the data are missing.
• Random forest able to do classification, clustering and outlier detection
Limitation• Random forests have been observed to overfit
for some datasets with noisy classification/regression tasks.
• Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.
Mukta Paliwal and Usha Kumar
TitleAcademic performance of business school graduates using neural network and statistical techniques.
OverviewThis research compare ANN with several statistical techniques. Paliwal conclude that the superior performance of the neural network techniques as compared to regression analysis for prediction problem whereas performance of neural network is comparable to logistic regression and discriminant analysis for classification problem.
J. ZimmermanTitle
Predicting graduate-level performance from undergraduate achievements
ResultThis research predicting graduate-level performance using random forest decision tree. From this research, we get information that random forest is not only able to do classification but also explain about significance of variable
Raw dataDATA GRADUATION OF INFORMATICS ENGINEERING MAGISTER STUDENT ITS (2008-2011)
Student ID Date of birthMagister
GPAMarried status Gender
Years of graduation
Scholar GPA
Scholar university Work status
51072010xx 5/28/1972 3.88 T L 99 2.72 ITS FTIF-ITS
510720101x 12/17/1977 3.80 T L 100 3.09 ITS
T.INFORMATIKA ITS
510720101x 8/3/1979 3.75 T P 99 3.19 ITS PPNS-ITS
510720101x 1/16/1976 3.31 T L 98 3.38
Universitas Dr. Soetomo
Politeknik Negeri Samarinda
510720101x 8/28/1979 3.50 T L 99 3.37
universitas hasanuddin
universitas tadulako
510720101x 8/11/1973 3.46 T L 99 3.15 STIKOM STIKOM
5107201x0x 7/16/1982 3.59 T L 101 3
Institut Africain de
ManagemenInstitut Africain de
Managemen
Preprocess (165 field)
• Filter data with null value• Change all attribute to number value• Change class attribute to nominal value
DatasetDATA GRADUATION OF INFORMATICS ENGINEERING MAGISTER STUDENT
ITS (2008-2011)
No Marital Status Gender Scholar university
Period of study Work status Scholar GPA Age (new
student)Class
(Magister GPA)
1 0 1 10 4 1 3.32 31B2 0 1 10 3 1 3.57 28B3 0 1 9 2 1 3.13 35A4 0 1 9 2 1 3.2 30A5 0 1 9 4 1 2.96 28A6 0 1 9 2 1 3.01 30A7 0 1 10 2 0 3.72 24A8 0 1 8 3 0 3.03 23A9 0 0 10 3 1 3.29 29A
10 0 1 10 2 0 3.64 22A11 0 1 10 3 1 3.28 27A12 0 1 7 5 0 2.57 30B13 0 1 10 3 1 3.15 33B
14 0 1 8 3 1 15.75 24B
No Variable Name Information Value
1 Marital Status Marital status when take magister college 0 = not married1 = married
2 Gender Gender of magister student 0 = woman1 = man
3 Scholar University
Rating of university with scale from 1-10 from survey of Webomatrics
10 = 35 big first rank9 = 35 big second rank, etc
4 Period of Study Time period for study Nominal (2-4 years)
5 Work Status Work status when take magister college 0 = not work1 = work
6 Scholar GPA GPA value at scholar Nominal (0-4)
7 Age (new student)
Age when take magister college Nominal
Information of Dataset Fitur 7 fitur and 104 field
Multilayer Perceptron Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 84.62% 73.08% 71.43%Incorrectly Classified Instances 15.38% 26.92% 28.57%Kappa Statistic 0.5649 0.2877 0.3243
Mean Absolute Error 0.2701 0.3486 0.3971
Root Mean Absolute Error 0.354 0.4764 0.5047
Relative Absolute Error 65.53% 84.51% 90.96% Logistic Regression Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 77.88% 75.96% 65.71%Incorrectly Classified Instances 22.12% 24.04% 34.29%Kappa Statistic 0.3197 0.2786 0.1286
Mean Absolute Error 0.3343 0.3658 0.3988
Root Mean Absolute Error 0.408 0.4447 0.452
Relative Absolute Error 81.10% 88.68% 91.35%
Naïve Bayes Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 76.92% 75.96% 68.57%Incorrectly Classified Instances 23.08% 24.04% 31.43%Kappa Statistic 0.2811 0.2415 0.2159Mean Absolute Error 0.2585 0.2737 0.32Root Mean Absolute Error 0.4504 0.4611 0.4952Relative Absolute Error 62.73% 66.35% 73.28% Random Forest Using Training Set Cross Validation Percentage SplitCorrectly Classified Instances 100% 68.27% 68.57%Incorrectly Classified Instances 0% 42.31% 31.43%Kappa Statistic 1 -0.0732 0.2159Mean Absolute Error 0.0933 0.3952 0.3686Root Mean Absolute Error 0.1431 0.5162 0.5235Relative Absolute Error 22.63% 95.82% 84.41%
Discussion
• Data training composition influence the performance of classifier technique.
• Random Forest analysis is overfit for some dataset. • Random Forest in accuracy is not better than other
methods for dataset with small fitur
Future Works
• Discard unimportant atribut dataset using Principal Component analysis.
• Finding any method to solve overfitting problem of Random Forest Decision Tree
Recommended