Upload
oussama-bouchou
View
225
Download
0
Embed Size (px)
Citation preview
8/2/2019 Data Mining Repport
1/19
Department of Computer Science and Electronic Engineering, University of Essex
Report of Data Mining
Data mining classification techniques implemented in Medical Decision Support
System
CE802 Machine Learning and Data Mining
Name: Bouchou Mohamed Oussama
E-Mail:[email protected]
Supervisor: Paul Scott
Date: 15 January 2012
AbstractThis paper, presents the analyse and the evaluation of some data mining classification
algorithms which are mainly applied on the modern medical decision support systems (mdss).
Various medical institutions store a very large quantity of data in order to extract relevant
information; these might be precious for the future of science and medicine as well. To be
able to extract the relevant data algorithms are needed. So three algorithms have been chosen
to realise this experiment and WEKA was the program used to conduct these analyses, in
addition to the four medical datasets employed. In fact it is hard to name a single algorithm to
be suitable for this kind of experience. There was no big gap between the results. However, it
reviles that Nave Bayes is the best classifier for such medical data.
mailto:[email protected]:[email protected]:[email protected]:[email protected]8/2/2019 Data Mining Repport
2/19
Data Mining report
2
TABLE OF CONTENTS:
ABSTRACT ........................................................................................................................................................ 1
1- INTRODUCTION ...................................................................................................................................... 3
2- THE OBJECTIVES OF THE EXPERIMENT .................................................................................................... 3
3- DATA MINING ALGORITHMS CLASSIFIERS ............................................................................................... 4
3-1DECISION TREE INDUCTION................................................................................................................................. 4
3-2MULTILAYERS PERCEPTRON (NEURAL NETWORKS) ................................................................................................... 4
3-3NAVE BAYES ................................................................................................................................................... 4
4- DESCRIPTION OF THE DATASETS ............................................................................................................. 5
4-1DATABASES DESCRIPTION.................................................................................................................................... 5
4-1-1 Breast cancer database ....................................................................................................................... 5
4-1-2 Hepatitis database ............................................................................................................................... 64-1-3 Heart disease database ....................................................................................................................... 6
4-1-4 Diabetes database ............................................................................................................................... 7
5- THE EXPERIMENT .................................................................................................................................... 8
5-1WEKA ........................................................................................................................................................... 8
5-2METHOD FOR EVALUATING THE ALGORITHMS ......................................................................................................... 9
5-3THE APPLICATION OF THE ALGORITHMS TO THE DATA SETS....................................................................................... 10
5-3-1 Breast cancer database ..................................................................................................................... 10
5-3-2 Hepatitis database ............................................................. ................................................................ 10
5-3-3 Heart disease database ..................................................................................................................... 11
5-3-4 Diabetes database ............................................................................................................................. 115-4THE MEDICAL PREDICTION................................................................................................................................. 12
5-5THE COMPARISON OF THE DATA MINING ALGORITHMS APPLIED TO MEDICAL DATABASES............................................... 13
CONCLUSION ................................................................................................................................................. 14
REFRENCES .................................................................................................................................................... 15
APPENDIX ...................................................................................................................................................... 16
8/2/2019 Data Mining Repport
3/19
Data Mining report
3
1- IntroductionThe healthcare is one of the richest sectors when talking about collecting information.
Medical information, date and knowledge are incredibly increasing by the time. It has beenestimated that an acute care may generate five terabytes of data a year [1]. In fact, it is crucial
and not evident to extract useful knowledge of information among this huge quantity of data.
The data collected is stored in databases or even in data warehouses, these databases systems
differs from one to another. In the last few decades a new type of medical system has
emerged in the domain of medicine [2]. The data stored in these medical systems might
contain a precious knowledge hidden in it. So and in order to process the data and retrieve a
useful knowledge experiments are needed.
It is undeniable that human making decision is almost always optimal especially when a
small amount of data to be processed, but it becomes hard and non-exact when a bog amountof data has to be processed.
The position cited above has pushed computer science engineers and medical staff to
contribute and work together. The objective of this collaboration is to develop the most
suitable method for data processing which will allow us to discover nontrivial rules. It results
in improving the process of diagnosis and treatment in addition to reducing the risk of
medical mistake.
This experiment aims to identify and evaluate the most common data mining algorithms
implemented in modern medical decision support system (mdss) [3]. A lot of experiments
have been done in this field and they were assessed using different measures with differentdatasets, so it makes the comparison between algorithms almost impossible. This paper
contrast and compares three of the common data mining classifiers algorithms (Nave Bayes,
multilayers perceptron, decision tree induction). All the conditions and the configurations
were prepared to make sure that the experiment will be conducted under the same conditions.
2- The objectives of the experimentThe main objectives of this experiment is to evaluate three selected data mining
classifiers algorithms which are commonly implemented in the medical decision support
system. After getting the results, it is important to compare the performances of the
algorithms and try to identify the most suitable and powerful one for the extraction of
knowledge from the medical data.
In the rest of this report
A partial definition of the algorithms used, in addition to the presentation of the
datasets
Conduct the experiment under WEKA
8/2/2019 Data Mining Repport
4/19
Data Mining report
4
Make a comparison between the performances of the algorithms and give a
conclusion
3- Data mining algorithms classifiersIn medicine, data mining is one of the tools used to search for a valuable pattern hidden.
Therefor giving a precise diagnosis as well as gaining a precious time. This part will describe
the three selected algorithms which are used by the experts commonly in such an experiment.
3-1 Decision Tree Induction
Decision trees are one of the most frequently used techniques of data analysis. The
advantages of this method are unquestionable. Decision trees are, among other things, easy to
visualize and understand and resistant to noise in data [4]. Commonly, decision trees are used
to classify records to a specific class. Moreover, they are applicable in both regression and
Association tasks.
Decision trees are applied successfully in medicine for example it is used to classify the
prostate cancer or breast cancer
3-2 Multilayers perceptron (Neural networks)
Neural network is a type ofartificial intelligence that attempts to imitate the way a human
brain works. Rather than using a digital model, in which all computations manipulate zeros
and ones, a neural networkworks by creating connections between processing elements, the
computer equivalent of neurons. The organization and weights of the connections determine
the output. Neural networks are particularly effective for predicting events when the networks
have a large database of prior examples to draw on. Strictly speaking, a neural networkimplies a non-digital computer, but neural networks can be simulated on digital computers.
In medicine diagnosis the input is the symptoms of the patient. The output however, is the
prediction of different diseases [5].
A type of an Artificial Neural Network depends on the basic unitperceptron. The
Perceptron takes an input value vector and outputs 1 if the results are greater than a
predefined threshold or 1 otherwise. The calculation is based on a linear combination of the
input values.
3-3 Nave Bayes
The Nave Bayes is a simple probabilistic classifier. It is based on an assumption aboutmutual independency of attributes (independent variable, independent feature model).
Usually this assumption is far from being true and this is the reason for the naivety of the
method [5]. The probabilities applied in the Nave Bayes algorithm are calculated according
to the Bayes Rule the probability of hypothesisHcan be calculated on the basis of the
hypothesisHand evidence about the hypothesisEaccording to the following formula:
(|) (|) ()
()
In the practice, Nave Bayes has shown that it can work efficiently in a real situation, For
instance in medicine it was used in diagnosis for the treatment of pneumonia.
http://www.webopedia.com/TERM/A/artificial_intelligence.htmlhttp://www.webopedia.com/TERM/D/digital.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/C/computer.htmlhttp://www.webopedia.com/TERM/O/output.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/database.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/database.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/O/output.htmlhttp://www.webopedia.com/TERM/C/computer.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/digital.htmlhttp://www.webopedia.com/TERM/A/artificial_intelligence.html8/2/2019 Data Mining Repport
5/19
Data Mining report
5
4- Description of the datasetsTo conduct the experiment four data bases were taken into consideration. In this part will
be discussing the source of the databases in addition to the data bases details.
The databases were taken from the famous UCI medical data repository, the reason for this
choice was because data bases vary from one to another, they are taken from different clinics
or hospitals, this permit us to compare and evaluate the performances of the algorithms in a
real condition. Another reason for choosing UCI is to let other persons to do the same
experiment and may be compare the results together.
4-1 Databases description
Here is the details description of the four databases chosen to realise the experiment they
were all taken from UCI, in addition to the details of the values of every database.
4-1-1 Breast cancer database
The following web link represents the source from where the data were taken
(http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-
cancer-wisconsin.data)
This table contains the details of the breast cancer database
Dataset Number of
attribute
Number of
symptoms
Number of
instances
Number of
classes
Type of
attributes
Missing
valuesBreast
cancer
11 9 699 2 Integer Yes
Table 4-1: Breast cancer database details
The following table summarise the values of the data base cited above
Symptoms name Values of the
attributes
Patient number id Ex:1000025
Clump Thickness 1 - 10
Uniformity of Cell Size 1 - 10Uniformity of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 110
Normal Nucleoli 1 - 10
Mitoses 1 - 10Number of diagnosis(classes) 2: malignant
4: benign
Table 4-2: Attribute and their possible values of the Breast cancer database
http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data8/2/2019 Data Mining Repport
6/19
Data Mining report
6
4-1-2 Hepatitis database
The following web link represents the source from where the data were taken
(http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data)
This table contains the details of the hepatitis database
Dataset Number ofattribute
Number of
symptoms
Number of
instances
Number of
classes
Type of
attributes
Missing
values
Hepatitis 20 19 155 2 Integer,
Real
Yes
Table 4-3: Hepatitis database details
The following table summarise the values of the data base cited above
Symptoms name Values of the attributes
Age 10, 20, 30, 40, 50.
Sex 1: Male 0:FemaleAntivirals 1: Yes, 0: NoFatigue 1: Yes, 0: NoAnorexia 1: Yes, 0: NoLiver big 1: Yes, 0: NoLiver firm 1: Yes, 0: NoMalaise 1: Yes, 0: NoSteroid 1: Yes, 0: NoSpleen palpable 1: Yes, 0: NoSpiders 1: Yes, 0: No
Ascites 1: Yes, 0: NoHistology 1: Yes, 0: NoVarices 1: Yes, 0: NoBilirubin 0.39, 0.40, 0.41..4Alk Phosphate 30, 90, 130..220, 300SGOT 13, 100, 200,400, 500Albumin 2.1, 3.0, 3.8, 4.5, 5.0, 6.0Protime 10, 20, 30, 40, 50, 60, 70, 80,
90Number ofdiagnosis(classes)
1: Alive
2: Die
Table 4-4: Attribute and their possible values of the Hepatitis database
4-1-3 Heart disease database
The following web link represents the source from where the data were taken
(http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.
data)
This table contains the details of the Heart disease database
http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data8/2/2019 Data Mining Repport
7/19
Data Mining report
7
Dataset Number ofattribute
Number of
symptoms
Number of
instances
Number of
classes
Type of
attributes
Missing
values
Heart
disease
14 13 303 5 Integer,
Real
Yes
Table 4-5: Heart disease database details
The following table summarise the values of the data base cited above
Symptoms name Values of the
attributes
Age 10, 20, 30, 40, 50.Sex 1: Male 0:FemaleChest pain type 1:typical angina
2:atypical angina
3:non angina pain
4: asymptomatic
Resting blood pressure in mm/Hg 94.200Serum cholesterol in mg/dl 126654Fasting blood sugar > 120mg/dl 1: Yes, 0: NoResting electrocardiographic results 0, 1, 2Maximum heart rate achieved 70200Exercise inducted angina 1: Yes, 0: NoST depression inducted by exercise
relative to rest0, 0.16.01
The slope of the peak exercise ST
segment1, 2, 3
Number of major vessels colored by
fluoroscopy
1, 2 ,3
Thal 3: Normal
6:Fixed defect
7:Reversable defectNumber of diagnosis(classes) 0, 1, 2, 3, 4
Table 4-6: Attribute and their possible values of the Heart disease database
4-1-4 Diabetes database
The following web link represents the source from where the data were taken
(http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-
indians-diabetes.data)
This table contains the details of the diabetes database
Dataset Number ofattribute
Number of
symptoms
Number of
instances
Number of
classes
Type of
attributes
Missing
values
Diabetes 9 8 768 2 Integer,
Real
Yes
Table 4-7: Diabetes database details
http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data8/2/2019 Data Mining Repport
8/19
Data Mining report
8
The following table summarise the values of the data base cited above
Symptoms name Values of the attributes
Age 10, 20, 30, 40, 50.
Number of time pregnant 0, 1, 2, 3, 4.
Plasma glucose concentration 0-199Diastolic blood pressure mm/Hg 24-122
Triceps skin fold thickness mm 7-99
Serum insulin muU/ml 14-850
Body mass index kg 18-68
Diabetes pedigree 0.01, 0.02..0.4..Number of diagnosis(classes) 0: non diabetic
1: diabetic
Table 4-8: Attribute and their possible values of the Heart disease database
5- The Experiment
5-1 WEKA
WEKA is a data mining program developed by the University of Waikato in New Zealand
that implements data mining algorithms using JAVA language. WEKA is a facility for
developing machine learning (ML) techniques and their application to the real world datamining problems. It is a collection of machine learning algorithms for data mining tasks. The
algorithms are applied directly to a dataset. WEKA implements algorithms for data pre-
processing, classification, regression, clustering and association rules; in addition to the
visualization tools. The machine learning schemes can also be developed with this package.
WEKA is an open source software issued under General Public License (GPL) [6].
The tree decision algorithm is called C4.5 algorithm, in WEKA it is called J48. It has a
pruning feature which is used to remove sections of classifiers that is based on noise data. It
results in reducing the phenomenon of over fitting.
The Nave Bayes is called in WEKA Nave Bayes. It is a simple algorithm, known for its
good performances especially when talking about a large data set because all the attributesare independent (conditional independence).
The neural network algorithm used is called multilayers perceptron, it consists of a layers of
nodes connected to others, except for the first one which is a neuron.
Basically it classify well especially for numeric values.
8/2/2019 Data Mining Repport
9/19
Data Mining report
9
5-2 Method for evaluating the algorithms
There are several methods for analysing and extracting knowledge from data. In our case
which is Medical Decision Support System, the correct and non-correct rate diagnoses should
be analysed. This classification is taken seriously into consideration, for the important impact
that might happen. In medical records the classification is not a certainty but a more accurate
prediction for the reasons that we do not have the ability to say this person will die or no and
when or this person is having cancer.
It is crucial that the experiment follow certain conditions and measures, because one of the
objectives is to determine what parameters setting yield the best model. Another aim of the
experiment is to find an optimal setting in order to maximise the performances.
During the experiment, we used the default data splits and fold cross validation. The wholetraining data set was kept in order to compares the algorithms in a fair environment.
Regarding the split, the more data is used for the training the better the model is built. At the
same time the more data testing is taken the more accurate is the result. So it is important to
divide the data set into two categories training data and testing data. But the question is what
the proportion of each one is? According the scientist the best results are obtained, when
66% of the data are used to build the model and the rest which is 34% is used just for the test.
For the fold cross validation, it says the experiments have proved that the best results are
obtained 10 folds (this means split the data set into 10 equal pieces) is used.
Every single algorithm was analysed under the same data split and cross validation
parameters. In order to achieve a fairer result, all the algorithms were initialised with the
same learning parameters during all the experiment.
Here are some of the selected values and results which were taken during the experiment:
Correct classified instances
Incorrect classified instances
Mean absolute error (MAE): this error has a tight relation with the classification
accuracy. So if the MAE increase the classification accuracy prediction decease
and vice versa
Root mean squared error (RMSE): it is almost the same thing as the MAE.
However the RMSE is considered as a better factor of classification than MAE
Precision: the probability of being correct, giver a decision.
Time of execution of the algorithm
8/2/2019 Data Mining Repport
10/19
Data Mining report
10
5-3 The application of the algorithms to the data sets5-3-1 Breast cancer database
Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training
set
Testing
set(66%)
10-fold
crossvalidation
Training
set
Testing
set(66%)
10-fold
crossvalidation
Training
set
Testing
set(66%)
10-fold
crossvalidation
Correct classified
instances96.13% 95.37% 95.99% 99.28% 95.79% 95.85% 98.56% 95.37% 94.56%
Incorrect classified
instance3.86% 4.62% 4.00% 0.71% 4.20% 4.14% 1.43% 4.62% 5.43%
Mean absolute error
(MAE)3.85% 4.73% 4.03% 1.79% 4.71% 4.72% 2.83% 6.48% 6.91%
Root mean error
squared (RMSE)19.28% 21.23% 19.83% 8.66% 19.04% 19.15% 11.79% 21.14% 22.28%
Precision 96.3% 95.4% 96.2% 99.3% 95.8% 95.9% 98.6% 95.4% 94.6%
Time (seconds) 0.13 0.03 0.03 13.7 13.54 13.42 0.23 0.11 0.09
Table 5-1: The performances of the three algorithms applied to the Breast cancer database
The table shows that tree algorithms performed well applied to the breast database, where the
smallest percentage is recorded for the C45 algorithm. This reflects the good precision as
well as a small marge of error. The highest precision is recorded when it reaches the 99%,
while the lowest point is 94%, this matches exactly the results obtain for the classification.
The smallest root mean error squared belongs to multilayer perceptron with 19.15%, Nave
Bayes errors comes after with 19.83%. Overall multilayer perceptron is slightly better than
the others regarding the results, followed by Nave Bayes and then C45 algorithm. However
when it comes to the time of execution, Nave Bayes is largely the winner with 0.03seconds.
5-3-2 Hepatitis database
Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training
set
Testing
set(66%)
10-fold
crossvalidation
Training
set
Testing
set(66%)
10-fold
crossvalidation
Training
set
Testing
set(66%)
10-fold
crossvalidation
Correct classified
instances70.96% 71.69% 70.96% 96.77% 66.03% 56.77% 82.58% 69.81% 58.06%
Incorrect classified
instance29.03% 28.30% 29.03% 3.22% 33.96% 43.22% 17.41% 30.18% 41.93%
Mean absolute error
(MAE) 28.19% 30.22% 29.86% 6.12% 36.86% 42.97% 28.21% 38.13% 43.57%Root mean error
squared (RMSE)48.65% 51.26% 50.61% 16.51% 54.41% 60.44% 36.59% 46.6% 55.1%
Precision 71.1% 74.7% 71.2% 96.8% 66.5% 56.8% 82.8% 72% 57.7%
Time (seconds) 0.01 0.01 0.02 7.47 7.02 7.74 0.08 0.08 0.08
Table 5-2: The performances of the three algorithms applied to the hepatitis database
The performances showed in this table are less convincing than the previous one which is
allocated to breast cancer. One reason for this result is that the database might contain a
bigger number of missing values than the breast database. Nave Bayes is the best classifier
for this database, but still the result poor. Multilayer perceptron gets the best precision but
just for the training set. However it gets the worst proportion the 10 fold cross-validation, in
this case Nave Bayes is the best classifier followed by the C45 algorithm. For the time of the
8/2/2019 Data Mining Repport
11/19
Data Mining report
11
execution Nave Bayes also has the shortest time with 0.01 seconds. O the other hand
multilayer perceptron has average of 7 seconds time for the execution, which is much higher
than Nave Bayes and even tree decision with an average of 0.08 seconds.
5-3-3 Heart disease database
Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training
set
Testing
set
(66%)
10-fold
cross
validation
Training
set
Testing
set
(66%)
10-fold
cross
validation
Training
set
Testing
set
(66%)
10-fold
cross
validationCorrect classified
instances63.36% 52.42% 56.43% 81.84% 54.36% 53.46% 78.54% 46.60% 52.47%
Incorrect classified
instance36.63% 47.57% 43.56% 18.15% 45.63% 46.53% 21.45% 53.39% 47.52%
Mean absolute error
(MAE)16.38% 19.8% 18.39% 9.22% 19.24% 19.07% 12.47% 21.79% 21.05%
Root mean error
squared (RMSE)30.66% 35.92% 33.97% 21.86% 39.9% 38.34% 24.92% 42.42% 40.11%
Precision 62.3% 53.3% 54.9% 85% 52.3% 51.8% 77.5% 48.3% 47.5%
Time (seconds) 0.02 0.02 0.02 13.85 13.01 12.22 0.13 0.11 0.11
Table 5-3: The performances of the three algorithms applied to the heart disease database
The algorithms were tasted with the third database which is heart disease heart. It confirms
the result obtained in the previous evaluation. As it has been showed, that the algorithms
performed well in the breast cancer database, then less perform ant in the hepatitis. The worst
classification actually is the classification of the heart disease which has shown the worst
result since the starting of the experiment. It has the highest proportion of the incorrect
classification as well as the lowest percentages of correct classification. Concerning the
timing, the rates still maintained stable and like always Nave Bayes is the fastest algorithm.
On the other hand multilayer perceptron timing is far from its two concurrent and considered
as the slowest.
5-3-4 Diabetes database
Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training
set
Testing
set(66%)
10-fold
crossvalidation
Training
set
Testing
set(66%)
10-fold
crossvalidation
Training
set
Testing
set(66%)
10-fold
crossvalidation
Correct classified
instances76.30% 77.01% 76.30% 80.59% 74.32% 75.39% 84.11% 76.24% 73.82%
Incorrect classified
instance23.69% 22.98% 23.69% 19.40% 25.67% 24.60% 15.88% 23.75% 26.17%
Mean absolute error
(MAE)28.11% 26.6% 28.41% 28.52% 31.86% 29.55% 23.83% 31.25% 31.58%
Root mean error
squared (RMSE)41.33% 38.22% 41.68% 38.15% 44.45% 42.15% 34.52% 40.59% 44.63%
Precision 75.9% 76.7% 75.9% 81.9% 75.6% 75% 84.2% 75.6% 73.5%
Time (seconds) 0.02 0.02 0.03 12.12 12.96 11.76 0.11 0.11 0.14
Table 5-4: The performances of the three algorithms applied to the diabetes database
8/2/2019 Data Mining Repport
12/19
Data Mining report
12
The result of the last experience is quite satisfying better than the last two previous.
Regarding the correct classified instances, the average is 75% and the lowest one is 73.8%
which a good average, and the same thing is noticed for the precision. The root mean error
squared results approves the percentage of classification. Nothing has changed for the
execution time, it still remaining the approximate results for the algorithms, where NaveBayes comes the first and end up with tree decision.
5-4 The medical prediction
Scientist give a great intention to a medical prediction, this is way data mining is working
to improve the medicine and the accuracy of diagnosis as well as prediction. The following
table summarise a data mining prediction
Survival Death
Predicted survival TP FPPredicted death FN TN
Table5-5: TP, FN, FP and TN prediction table
An interpretation for the values could be useful. True Positive TP means the number of
patients predicted to survive among survival patients. True Negative TN is the patients who
are predicted to die among death patients. False Positive FP is the patients who are predicted
to survive among death patients and False Negative FN is patients who are predicted to die
among survival patients.
As the experiment was done for four different databases, it was preferable to choose one andgive a small example of medical prediction. Hepatitis database was chosen for this medical
prediction and the 10fold cross validation was chosen as a method for its best precision for
the experiment done before. Number of patient is 155 so TP+FN+FP+TN=155. The
following results were obtained relaying on the confusion matrix (see appendix 2).
Nave Bayes Survival Death
Predicted survival 70 (45.16%) 15 (9.67%)Predicted death 30 (19.35%) 40 (25.80%)
Table5-6: prediction for the hepatitis by Nave Bayes algorithm
Multilayer Perceptron Survival DeathPredicted survival 51 (32.9%) 34 (21.93%)
Predicted death 33 (21.29%) 37 (23.87%)
Table5-7: prediction for the hepatitis by multilayer perceptron algorithm
Decision Tree (C45) Survival DeathPredicted survival 57 (36.77%) 28 (18.06%)
Predicted death 37 (23.87%) 33 (21.29%)
Table5-8: prediction for the hepatitis by Decision tree algorithm
8/2/2019 Data Mining Repport
13/19
Data Mining report
13
5-5 The Comparison of the data mining algorithms applied to medical databases
In this part we will be analysing and evaluating the results and performances of the
algorithms which are drawn in the previous tables. Each algorithm was evaluatedindividually, for this reason it is possible to select the most suitable one for the classification
of medical data. In fact three methods have been adopted but two of them are taken into
serious consideration. The methods are the testing set and the fold10 cross-validation. Is it
hard to name or state a random algorithm test configuration to be the best, but at the final
stage, 10 fold cross-validation was selected among all the other configurations. The reason
for choosing 10 fold cross-validations is simply because of its highest performances when it
claims that better than any split. Its popularity to be used to classify such databases is also
one of the reasons that made us choosing it rather than another configuration.
A comparison can be done, relying on the results presented on the previous tables experiment
and in appendix 1 (visualisation statistic). The results gained were almost all satisfying, apartthe heart disease results which is not a disaster but the worst among the others. At the second
comes the hepatitis and diabetes nearly with similar proportions. Finally, the best results were
gain belongs to the breast cancer which it interacts perfectly with the algorithms. This
excellent result for the breast cancer allows us to deduce that this data set could be used for
training data. Regarding algorithms, the boss is Nave Bayes when the majority of the
proportions are superior to other results, followed by multilayer perceptron and then tree
decision algorithm at the last place. For the training set multilayer perceptron and tree
decision algorithm beat nave Bayes in all the cases but it is not a good factor to be taken to
judge the performance of algorithms. However, the testing set and the 10fold cross-validation
nave Bayes produced excellent results which conduct us to award the first place of this test
to nave Bayes. The results delivered for the errors were quite high, the reason might be the
heterogeneity of the medical data and the complexity of the values of each attribute, and this
might affect the performance of classification negatively. In our case multilayer perceptron
and especially decision tree algorithm might have been over trained for the disappointing
results obtained. Surprisingly, closet results to multilayer perceptron and nave Bayes were
yield by decision tree algorithm when training the hepatitis data set. This unique results,
allows us to look carefully to the hepatitis data set, which its almost attributes is binary. We
conclude that binominal data set is a good source of training data for tree decision algorithms.
When talking about the time of execution, here nave Bayes again is largely the winner
without exception, followed by decision tree then far from this two multilayer perceptron.
A final conclusion drawn for this comparison is that nave Bayes is the most suitablealgorithm for a medical data set in term of classification, this in both timing and
performances.
8/2/2019 Data Mining Repport
14/19
Data Mining report
14
Conclusion
As mentioned in the introduction, a huge amount of data is gathered daily then stored in
medical databases. These databases could contain a nontrivial dependencies symptoms and
diagnosis. The process of this data could be realised with the use of medical systems, this
medical system make it easy to uncover some unclear medical results. So with the betterdiagnosis and prediction knowledge, it is much easier for doctors to give an accurate
diagnosis and in a very short time for future cases.
The objective of the experiment is to identify and evaluate the performance of the most
suitable data mining algorithm implemented in modern Medical Decision Support System.
Three algorithms were chosen (Nave Bayes, multilayers perceptron, decision tree induction)
and four datasets were selected (Breast cancer, Hepatitis, Heart disease, Diabetes) to conduct
the experiment. It was crucial to use as much as possible the same measures and
configuration, in order to a correct classification and it was realised successfully
While the average accuracy classification of Nave Bayes is quite higher that the decision treeand multilayer perceptron, it would not fair to conclude that Nave Bayes is a better classifier
task that tree decision and multilayer perceptron. It is suggested that Nave Bayes classifier
has the potential to significantly improve the conventional classification methods for use in
medical or bioinformatics sector.
8/2/2019 Data Mining Repport
15/19
Data Mining report
15
References
[1] Huang, H. et al. Business rule extraction from legacy code, Proceedings of 20th
International Conference on Computer Software and Applications, IEEE
COMPSAC96, 1996, pp.162-167
[2] Chae Y. M., Kim H. S., Tark K. C., Park H. J., Ho S. H.,Analysis of healthcare quality
indicator using data mining and decision support system. Expert Systems with Applications,
2003, 167172.
[3]Duch W, Grabczewski K, Adamczak R, Grudzinski K, Hippe Z.S,Rules for melanoma
skin cancer diagnosis, 2001 http://www.phys.uni.torun.pl/publications/kmk/ retrieved on
4.04.2007
[4]Witten I. H., Frank E.,Data Mining, Practical Machine Learning Tools and Techniques,
2nd Elsevier, 2005
[5] Kamila, A.Evaluation of selected data mining algorithms implemented in Medical
Decision Support Systems, September 2007
[6] Available online athttp://www.cs.waikato.ac.nz/~ml/weka
http://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/weka8/2/2019 Data Mining Repport
16/19
Data Mining report
16
Appendix
Appendix 1: Evaluation of performance of the data mining algorithms for the medical
databases for 10 folds crosses validation
Correct classified instances
Incorrect classified instances
Mean absolute error (MAE)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Nave Bayes
Multilayer p
C45
0%10%20%30%40%50%
60%70%80%90%
100%
Nave Bayes
Multilayer p
C45
0%10%20%30%40%50%60%70%80%90%
100%
Nave Bayes
Multilayer p
C45
8/2/2019 Data Mining Repport
17/19
Data Mining report
17
Root mean error squared (RMSE)
Precision
0%10%20%30%40%50%60%70%80%90%
100%
Nave Bayes
Multilayer p
C45
0%10%20%30%40%50%60%70%80%90%
100%
Nave Bayes
Multilayer p
C45
8/2/2019 Data Mining Repport
18/19
Data Mining report
18
Appendix 2: Statistic for predicting the diagnosis of the hepatitis data set (10 fold
cross-validations)
Nave BayesTotal Number of Instances 155
Correctly Classified Instances 110 (70.96 %)
Incorrectly Classified Instances 45 (29.03 %)
Kappa statistic 0.4026
Mean absolute error 0.2986
Root mean squared error 0.5061
Relative absolute error 60.26%
Root relative squared error 101.69%
Time taken to build model 0.02 seconds
Precision 71.2%
Number of class 2
Confusion MatrixSurvival Death
Predicted survival 70 (45.16%) 15 (9.67%)
Predicted death 30 (19.35%) 40 (25.80%)
Multilayer perceptron
Total Number of Instances 155
Correctly Classified Instances 88 (56.77 %)
Incorrectly Classified Instances 67 (43.22 %)
Kappa statistic 0.1248
Mean absolute error 0.4297
Root mean squared error 0.6044
Relative absolute error 86.72%
Root relative squared error 121.11%
Time taken to build model 7.44 seconds
Precision 56.8%
Number of class 2
Confusion Matrix Survival Death
Predicted survival 51 (32.9%) 34 (21.93%)
Predicted death 33 (21.29%) 37 (23.87%)
8/2/2019 Data Mining Repport
19/19
Data Mining report
19
Decision tree induction (C45)
Total Number of Instances 155
Correctly Classified Instances 65 (58.06 %)
Incorrectly Classified Instances 67 (41.93 %)
Kappa statistic 0.1436
Mean absolute error 0.551Root mean squared error 0.6044
Relative absolute error 87.93%
Root relative squared error 110.71%
Time taken to build model 0.08seconds
Precision 57.7%
Number of class 2
Confusion MatrixSurvival Death
Predicted survival 57 (36.77%) 28 (18.06%)
Predicted death 37 (23.87%) 33 (21.29%)