Data Mining Repport

8/2/2019 Data Mining Repport

1/19

Department of Computer Science and Electronic Engineering, University of Essex

Report of Data Mining

Data mining classification techniques implemented in Medical Decision Support

System

CE802 Machine Learning and Data Mining

Name: Bouchou Mohamed Oussama

E-Mail:[email protected]

Supervisor: Paul Scott

Date: 15 January 2012

AbstractThis paper, presents the analyse and the evaluation of some data mining classification

algorithms which are mainly applied on the modern medical decision support systems (mdss).

Various medical institutions store a very large quantity of data in order to extract relevant

information; these might be precious for the future of science and medicine as well. To be

able to extract the relevant data algorithms are needed. So three algorithms have been chosen

to realise this experiment and WEKA was the program used to conduct these analyses, in

addition to the four medical datasets employed. In fact it is hard to name a single algorithm to

be suitable for this kind of experience. There was no big gap between the results. However, it

reviles that Nave Bayes is the best classifier for such medical data.
mailto:[email protected]:[email protected]:[email protected]:[email protected]


2/19

Data Mining report

2

TABLE OF CONTENTS:

ABSTRACT ........................................................................................................................................................ 1

1- INTRODUCTION ...................................................................................................................................... 3

2- THE OBJECTIVES OF THE EXPERIMENT .................................................................................................... 3

3- DATA MINING ALGORITHMS CLASSIFIERS ............................................................................................... 4

3-1DECISION TREE INDUCTION................................................................................................................................. 4

3-2MULTILAYERS PERCEPTRON (NEURAL NETWORKS) ................................................................................................... 4

3-3NAVE BAYES ................................................................................................................................................... 4

4- DESCRIPTION OF THE DATASETS ............................................................................................................. 5

4-1DATABASES DESCRIPTION.................................................................................................................................... 5

4-1-1 Breast cancer database ....................................................................................................................... 5

4-1-2 Hepatitis database ............................................................................................................................... 64-1-3 Heart disease database ....................................................................................................................... 6

4-1-4 Diabetes database ............................................................................................................................... 7

5- THE EXPERIMENT .................................................................................................................................... 8

5-1WEKA ........................................................................................................................................................... 8

5-2METHOD FOR EVALUATING THE ALGORITHMS ......................................................................................................... 9

5-3THE APPLICATION OF THE ALGORITHMS TO THE DATA SETS....................................................................................... 10

5-3-1 Breast cancer database ..................................................................................................................... 10

5-3-2 Hepatitis database ............................................................. ................................................................ 10

5-3-3 Heart disease database ..................................................................................................................... 11

5-3-4 Diabetes database ............................................................................................................................. 115-4THE MEDICAL PREDICTION................................................................................................................................. 12

5-5THE COMPARISON OF THE DATA MINING ALGORITHMS APPLIED TO MEDICAL DATABASES............................................... 13

CONCLUSION ................................................................................................................................................. 14

REFRENCES .................................................................................................................................................... 15

APPENDIX ...................................................................................................................................................... 16


3/19

Data Mining report

3

1- IntroductionThe healthcare is one of the richest sectors when talking about collecting information.

Medical information, date and knowledge are incredibly increasing by the time. It has beenestimated that an acute care may generate five terabytes of data a year [1]. In fact, it is crucial

and not evident to extract useful knowledge of information among this huge quantity of data.

The data collected is stored in databases or even in data warehouses, these databases systems

differs from one to another. In the last few decades a new type of medical system has

emerged in the domain of medicine [2]. The data stored in these medical systems might

contain a precious knowledge hidden in it. So and in order to process the data and retrieve a

useful knowledge experiments are needed.

It is undeniable that human making decision is almost always optimal especially when a

small amount of data to be processed, but it becomes hard and non-exact when a bog amountof data has to be processed.

The position cited above has pushed computer science engineers and medical staff to

contribute and work together. The objective of this collaboration is to develop the most

suitable method for data processing which will allow us to discover nontrivial rules. It results

in improving the process of diagnosis and treatment in addition to reducing the risk of

medical mistake.

This experiment aims to identify and evaluate the most common data mining algorithms

implemented in modern medical decision support system (mdss) [3]. A lot of experiments

have been done in this field and they were assessed using different measures with differentdatasets, so it makes the comparison between algorithms almost impossible. This paper

contrast and compares three of the common data mining classifiers algorithms (Nave Bayes,

multilayers perceptron, decision tree induction). All the conditions and the configurations

were prepared to make sure that the experiment will be conducted under the same conditions.

2- The objectives of the experimentThe main objectives of this experiment is to evaluate three selected data mining

classifiers algorithms which are commonly implemented in the medical decision support

system. After getting the results, it is important to compare the performances of the

algorithms and try to identify the most suitable and powerful one for the extraction of

knowledge from the medical data.

In the rest of this report

A partial definition of the algorithms used, in addition to the presentation of the

datasets

Conduct the experiment under WEKA


4/19

Data Mining report

4

Make a comparison between the performances of the algorithms and give a

conclusion

3- Data mining algorithms classifiersIn medicine, data mining is one of the tools used to search for a valuable pattern hidden.

Therefor giving a precise diagnosis as well as gaining a precious time. This part will describe

the three selected algorithms which are used by the experts commonly in such an experiment.

3-1 Decision Tree Induction

Decision trees are one of the most frequently used techniques of data analysis. The

advantages of this method are unquestionable. Decision trees are, among other things, easy to

visualize and understand and resistant to noise in data [4]. Commonly, decision trees are used

to classify records to a specific class. Moreover, they are applicable in both regression and

Association tasks.

Decision trees are applied successfully in medicine for example it is used to classify the

prostate cancer or breast cancer

3-2 Multilayers perceptron (Neural networks)

Neural network is a type ofartificial intelligence that attempts to imitate the way a human

brain works. Rather than using a digital model, in which all computations manipulate zeros

and ones, a neural networkworks by creating connections between processing elements, the

computer equivalent of neurons. The organization and weights of the connections determine

the output. Neural networks are particularly effective for predicting events when the networks

have a large database of prior examples to draw on. Strictly speaking, a neural networkimplies a non-digital computer, but neural networks can be simulated on digital computers.

In medicine diagnosis the input is the symptoms of the patient. The output however, is the

prediction of different diseases [5].

A type of an Artificial Neural Network depends on the basic unitperceptron. The

Perceptron takes an input value vector and outputs 1 if the results are greater than a

predefined threshold or 1 otherwise. The calculation is based on a linear combination of the

input values.

3-3 Nave Bayes

The Nave Bayes is a simple probabilistic classifier. It is based on an assumption aboutmutual independency of attributes (independent variable, independent feature model).

Usually this assumption is far from being true and this is the reason for the naivety of the

method [5]. The probabilities applied in the Nave Bayes algorithm are calculated according

to the Bayes Rule the probability of hypothesisHcan be calculated on the basis of the

hypothesisHand evidence about the hypothesisEaccording to the following formula:

(|) (|) ()

()

In the practice, Nave Bayes has shown that it can work efficiently in a real situation, For

instance in medicine it was used in diagnosis for the treatment of pneumonia.
http://www.webopedia.com/TERM/A/artificial_intelligence.htmlhttp://www.webopedia.com/TERM/D/digital.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/C/computer.htmlhttp://www.webopedia.com/TERM/O/output.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/database.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/database.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/O/output.htmlhttp://www.webopedia.com/TERM/C/computer.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/digital.htmlhttp://www.webopedia.com/TERM/A/artificial_intelligence.html


5/19

Data Mining report

5

4- Description of the datasetsTo conduct the experiment four data bases were taken into consideration. In this part will

be discussing the source of the databases in addition to the data bases details.

The databases were taken from the famous UCI medical data repository, the reason for this

choice was because data bases vary from one to another, they are taken from different clinics

or hospitals, this permit us to compare and evaluate the performances of the algorithms in a

real condition. Another reason for choosing UCI is to let other persons to do the same

experiment and may be compare the results together.

4-1 Databases description

Here is the details description of the four databases chosen to realise the experiment they

were all taken from UCI, in addition to the details of the values of every database.

4-1-1 Breast cancer database

The following web link represents the source from where the data were taken

(http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-

cancer-wisconsin.data)

This table contains the details of the breast cancer database

Dataset Number of

attribute

Number of

symptoms

Number of

instances

Number of

classes

Type of

attributes

Missing

valuesBreast

cancer

11 9 699 2 Integer Yes

Table 4-1: Breast cancer database details

The following table summarise the values of the data base cited above

Symptoms name Values of the

attributes

Patient number id Ex:1000025

Clump Thickness 1 - 10

Uniformity of Cell Size 1 - 10Uniformity of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 110

Normal Nucleoli 1 - 10

Mitoses 1 - 10Number of diagnosis(classes) 2: malignant

4: benign

Table 4-2: Attribute and their possible values of the Breast cancer database
http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data


6/19

Data Mining report

6

4-1-2 Hepatitis database


(http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data)

This table contains the details of the hepatitis database

Dataset Number ofattribute

Number of

symptoms

Number of

instances

Number of

classes

Type of

attributes

Missing

values

Hepatitis 20 19 155 2 Integer,

Real

Yes

Table 4-3: Hepatitis database details


Symptoms name Values of the attributes

Age 10, 20, 30, 40, 50.

Sex 1: Male 0:FemaleAntivirals 1: Yes, 0: NoFatigue 1: Yes, 0: NoAnorexia 1: Yes, 0: NoLiver big 1: Yes, 0: NoLiver firm 1: Yes, 0: NoMalaise 1: Yes, 0: NoSteroid 1: Yes, 0: NoSpleen palpable 1: Yes, 0: NoSpiders 1: Yes, 0: No

Ascites 1: Yes, 0: NoHistology 1: Yes, 0: NoVarices 1: Yes, 0: NoBilirubin 0.39, 0.40, 0.41..4Alk Phosphate 30, 90, 130..220, 300SGOT 13, 100, 200,400, 500Albumin 2.1, 3.0, 3.8, 4.5, 5.0, 6.0Protime 10, 20, 30, 40, 50, 60, 70, 80,

90Number ofdiagnosis(classes)

1: Alive

2: Die

Table 4-4: Attribute and their possible values of the Hepatitis database

4-1-3 Heart disease database


(http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.

data)

This table contains the details of the Heart disease database
http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data


7/19

Data Mining report

7


Number of

symptoms

Number of

instances

Number of

classes

Type of

attributes

Missing

values

Heart

disease

14 13 303 5 Integer,

Real

Yes

Table 4-5: Heart disease database details


Symptoms name Values of the

attributes

Age 10, 20, 30, 40, 50.Sex 1: Male 0:FemaleChest pain type 1:typical angina

2:atypical angina

3:non angina pain

4: asymptomatic

Resting blood pressure in mm/Hg 94.200Serum cholesterol in mg/dl 126654Fasting blood sugar > 120mg/dl 1: Yes, 0: NoResting electrocardiographic results 0, 1, 2Maximum heart rate achieved 70200Exercise inducted angina 1: Yes, 0: NoST depression inducted by exercise

relative to rest0, 0.16.01

The slope of the peak exercise ST

segment1, 2, 3

Number of major vessels colored by

fluoroscopy

1, 2 ,3

Thal 3: Normal

6:Fixed defect

7:Reversable defectNumber of diagnosis(classes) 0, 1, 2, 3, 4

Table 4-6: Attribute and their possible values of the Heart disease database

4-1-4 Diabetes database


(http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-

indians-diabetes.data)

This table contains the details of the diabetes database


Number of

symptoms

Number of

instances

Number of

classes

Type of

attributes

Missing

values

Diabetes 9 8 768 2 Integer,

Real

Yes

Table 4-7: Diabetes database details
http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data


8/19

Data Mining report

8


Symptoms name Values of the attributes

Age 10, 20, 30, 40, 50.

Number of time pregnant 0, 1, 2, 3, 4.

Plasma glucose concentration 0-199Diastolic blood pressure mm/Hg 24-122

Triceps skin fold thickness mm 7-99

Serum insulin muU/ml 14-850

Body mass index kg 18-68

Diabetes pedigree 0.01, 0.02..0.4..Number of diagnosis(classes) 0: non diabetic

1: diabetic

Table 4-8: Attribute and their possible values of the Heart disease database

5- The Experiment

5-1 WEKA

WEKA is a data mining program developed by the University of Waikato in New Zealand

that implements data mining algorithms using JAVA language. WEKA is a facility for

developing machine learning (ML) techniques and their application to the real world datamining problems. It is a collection of machine learning algorithms for data mining tasks. The

algorithms are applied directly to a dataset. WEKA implements algorithms for data pre-

processing, classification, regression, clustering and association rules; in addition to the

visualization tools. The machine learning schemes can also be developed with this package.

WEKA is an open source software issued under General Public License (GPL) [6].

The tree decision algorithm is called C4.5 algorithm, in WEKA it is called J48. It has a

pruning feature which is used to remove sections of classifiers that is based on noise data. It

results in reducing the phenomenon of over fitting.

The Nave Bayes is called in WEKA Nave Bayes. It is a simple algorithm, known for its

good performances especially when talking about a large data set because all the attributesare independent (conditional independence).

The neural network algorithm used is called multilayers perceptron, it consists of a layers of

nodes connected to others, except for the first one which is a neuron.

Basically it classify well especially for numeric values.


9/19

Data Mining report

9

5-2 Method for evaluating the algorithms

There are several methods for analysing and extracting knowledge from data. In our case

which is Medical Decision Support System, the correct and non-correct rate diagnoses should

be analysed. This classification is taken seriously into consideration, for the important impact

that might happen. In medical records the classification is not a certainty but a more accurate

prediction for the reasons that we do not have the ability to say this person will die or no and

when or this person is having cancer.

It is crucial that the experiment follow certain conditions and measures, because one of the

objectives is to determine what parameters setting yield the best model. Another aim of the

experiment is to find an optimal setting in order to maximise the performances.

During the experiment, we used the default data splits and fold cross validation. The wholetraining data set was kept in order to compares the algorithms in a fair environment.

Regarding the split, the more data is used for the training the better the model is built. At the

same time the more data testing is taken the more accurate is the result. So it is important to

divide the data set into two categories training data and testing data. But the question is what

the proportion of each one is? According the scientist the best results are obtained, when

66% of the data are used to build the model and the rest which is 34% is used just for the test.

For the fold cross validation, it says the experiments have proved that the best results are

obtained 10 folds (this means split the data set into 10 equal pieces) is used.

Every single algorithm was analysed under the same data split and cross validation

parameters. In order to achieve a fairer result, all the algorithms were initialised with the

same learning parameters during all the experiment.

Here are some of the selected values and results which were taken during the experiment:

Correct classified instances

Incorrect classified instances

Mean absolute error (MAE): this error has a tight relation with the classification

accuracy. So if the MAE increase the classification accuracy prediction decease

and vice versa

Root mean squared error (RMSE): it is almost the same thing as the MAE.

However the RMSE is considered as a better factor of classification than MAE

Precision: the probability of being correct, giver a decision.

Time of execution of the algorithm


10/19

Data Mining report

10

5-3 The application of the algorithms to the data sets5-3-1 Breast cancer database

Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training

set

Testing

set(66%)

10-fold

crossvalidation

Training

set

Testing

set(66%)

10-fold

crossvalidation

Training

set

Testing

set(66%)

10-fold

crossvalidation

Correct classified

instances96.13% 95.37% 95.99% 99.28% 95.79% 95.85% 98.56% 95.37% 94.56%

Incorrect classified

instance3.86% 4.62% 4.00% 0.71% 4.20% 4.14% 1.43% 4.62% 5.43%

Mean absolute error

(MAE)3.85% 4.73% 4.03% 1.79% 4.71% 4.72% 2.83% 6.48% 6.91%

Root mean error

squared (RMSE)19.28% 21.23% 19.83% 8.66% 19.04% 19.15% 11.79% 21.14% 22.28%

Precision 96.3% 95.4% 96.2% 99.3% 95.8% 95.9% 98.6% 95.4% 94.6%

Time (seconds) 0.13 0.03 0.03 13.7 13.54 13.42 0.23 0.11 0.09

Table 5-1: The performances of the three algorithms applied to the Breast cancer database

The table shows that tree algorithms performed well applied to the breast database, where the

smallest percentage is recorded for the C45 algorithm. This reflects the good precision as

well as a small marge of error. The highest precision is recorded when it reaches the 99%,

while the lowest point is 94%, this matches exactly the results obtain for the classification.

The smallest root mean error squared belongs to multilayer perceptron with 19.15%, Nave

Bayes errors comes after with 19.83%. Overall multilayer perceptron is slightly better than

the others regarding the results, followed by Nave Bayes and then C45 algorithm. However

when it comes to the time of execution, Nave Bayes is largely the winner with 0.03seconds.

5-3-2 Hepatitis database


set

Testing

set(66%)

10-fold

crossvalidation

Training

set

Testing

set(66%)

10-fold

crossvalidation

Training

set

Testing

set(66%)

10-fold

crossvalidation

Correct classified

instances70.96% 71.69% 70.96% 96.77% 66.03% 56.77% 82.58% 69.81% 58.06%


instance29.03% 28.30% 29.03% 3.22% 33.96% 43.22% 17.41% 30.18% 41.93%

Mean absolute error

(MAE) 28.19% 30.22% 29.86% 6.12% 36.86% 42.97% 28.21% 38.13% 43.57%Root mean error

squared (RMSE)48.65% 51.26% 50.61% 16.51% 54.41% 60.44% 36.59% 46.6% 55.1%

Precision 71.1% 74.7% 71.2% 96.8% 66.5% 56.8% 82.8% 72% 57.7%

Time (seconds) 0.01 0.01 0.02 7.47 7.02 7.74 0.08 0.08 0.08

Table 5-2: The performances of the three algorithms applied to the hepatitis database

The performances showed in this table are less convincing than the previous one which is

allocated to breast cancer. One reason for this result is that the database might contain a

bigger number of missing values than the breast database. Nave Bayes is the best classifier

for this database, but still the result poor. Multilayer perceptron gets the best precision but

just for the training set. However it gets the worst proportion the 10 fold cross-validation, in

this case Nave Bayes is the best classifier followed by the C45 algorithm. For the time of the


11/19

Data Mining report

11

execution Nave Bayes also has the shortest time with 0.01 seconds. O the other hand

multilayer perceptron has average of 7 seconds time for the execution, which is much higher

than Nave Bayes and even tree decision with an average of 0.08 seconds.

5-3-3 Heart disease database


set

Testing

set

(66%)

10-fold

cross

validation

Training

set

Testing

set

(66%)

10-fold

cross

validation

Training

set

Testing

set

(66%)

10-fold

cross

validationCorrect classified

instances63.36% 52.42% 56.43% 81.84% 54.36% 53.46% 78.54% 46.60% 52.47%


instance36.63% 47.57% 43.56% 18.15% 45.63% 46.53% 21.45% 53.39% 47.52%

Mean absolute error

(MAE)16.38% 19.8% 18.39% 9.22% 19.24% 19.07% 12.47% 21.79% 21.05%

Root mean error

squared (RMSE)30.66% 35.92% 33.97% 21.86% 39.9% 38.34% 24.92% 42.42% 40.11%

Precision 62.3% 53.3% 54.9% 85% 52.3% 51.8% 77.5% 48.3% 47.5%

Time (seconds) 0.02 0.02 0.02 13.85 13.01 12.22 0.13 0.11 0.11

Table 5-3: The performances of the three algorithms applied to the heart disease database

The algorithms were tasted with the third database which is heart disease heart. It confirms

the result obtained in the previous evaluation. As it has been showed, that the algorithms

performed well in the breast cancer database, then less perform ant in the hepatitis. The worst

classification actually is the classification of the heart disease which has shown the worst

result since the starting of the experiment. It has the highest proportion of the incorrect

classification as well as the lowest percentages of correct classification. Concerning the

timing, the rates still maintained stable and like always Nave Bayes is the fastest algorithm.

On the other hand multilayer perceptron timing is far from its two concurrent and considered

as the slowest.

5-3-4 Diabetes database


set

Testing

set(66%)

10-fold

crossvalidation

Training

set

Testing

set(66%)

10-fold

crossvalidation

Training

set

Testing

set(66%)

10-fold

crossvalidation

Correct classified

instances76.30% 77.01% 76.30% 80.59% 74.32% 75.39% 84.11% 76.24% 73.82%


instance23.69% 22.98% 23.69% 19.40% 25.67% 24.60% 15.88% 23.75% 26.17%

Mean absolute error

(MAE)28.11% 26.6% 28.41% 28.52% 31.86% 29.55% 23.83% 31.25% 31.58%

Root mean error

squared (RMSE)41.33% 38.22% 41.68% 38.15% 44.45% 42.15% 34.52% 40.59% 44.63%

Precision 75.9% 76.7% 75.9% 81.9% 75.6% 75% 84.2% 75.6% 73.5%

Time (seconds) 0.02 0.02 0.03 12.12 12.96 11.76 0.11 0.11 0.14

Table 5-4: The performances of the three algorithms applied to the diabetes database


12/19

Data Mining report

12

The result of the last experience is quite satisfying better than the last two previous.

Regarding the correct classified instances, the average is 75% and the lowest one is 73.8%

which a good average, and the same thing is noticed for the precision. The root mean error

squared results approves the percentage of classification. Nothing has changed for the

execution time, it still remaining the approximate results for the algorithms, where NaveBayes comes the first and end up with tree decision.

5-4 The medical prediction

Scientist give a great intention to a medical prediction, this is way data mining is working

to improve the medicine and the accuracy of diagnosis as well as prediction. The following

table summarise a data mining prediction

Survival Death

Predicted survival TP FPPredicted death FN TN

Table5-5: TP, FN, FP and TN prediction table

An interpretation for the values could be useful. True Positive TP means the number of

patients predicted to survive among survival patients. True Negative TN is the patients who

are predicted to die among death patients. False Positive FP is the patients who are predicted

to survive among death patients and False Negative FN is patients who are predicted to die

among survival patients.

As the experiment was done for four different databases, it was preferable to choose one andgive a small example of medical prediction. Hepatitis database was chosen for this medical

prediction and the 10fold cross validation was chosen as a method for its best precision for

the experiment done before. Number of patient is 155 so TP+FN+FP+TN=155. The

following results were obtained relaying on the confusion matrix (see appendix 2).

Nave Bayes Survival Death

Predicted survival 70 (45.16%) 15 (9.67%)Predicted death 30 (19.35%) 40 (25.80%)

Table5-6: prediction for the hepatitis by Nave Bayes algorithm

Multilayer Perceptron Survival DeathPredicted survival 51 (32.9%) 34 (21.93%)

Predicted death 33 (21.29%) 37 (23.87%)

Table5-7: prediction for the hepatitis by multilayer perceptron algorithm

Decision Tree (C45) Survival DeathPredicted survival 57 (36.77%) 28 (18.06%)

Predicted death 37 (23.87%) 33 (21.29%)

Table5-8: prediction for the hepatitis by Decision tree algorithm


13/19

Data Mining report

13

5-5 The Comparison of the data mining algorithms applied to medical databases

In this part we will be analysing and evaluating the results and performances of the

algorithms which are drawn in the previous tables. Each algorithm was evaluatedindividually, for this reason it is possible to select the most suitable one for the classification

of medical data. In fact three methods have been adopted but two of them are taken into

serious consideration. The methods are the testing set and the fold10 cross-validation. Is it

hard to name or state a random algorithm test configuration to be the best, but at the final

stage, 10 fold cross-validation was selected among all the other configurations. The reason

for choosing 10 fold cross-validations is simply because of its highest performances when it

claims that better than any split. Its popularity to be used to classify such databases is also

one of the reasons that made us choosing it rather than another configuration.

A comparison can be done, relying on the results presented on the previous tables experiment

and in appendix 1 (visualisation statistic). The results gained were almost all satisfying, apartthe heart disease results which is not a disaster but the worst among the others. At the second

comes the hepatitis and diabetes nearly with similar proportions. Finally, the best results were

gain belongs to the breast cancer which it interacts perfectly with the algorithms. This

excellent result for the breast cancer allows us to deduce that this data set could be used for

training data. Regarding algorithms, the boss is Nave Bayes when the majority of the

proportions are superior to other results, followed by multilayer perceptron and then tree

decision algorithm at the last place. For the training set multilayer perceptron and tree

decision algorithm beat nave Bayes in all the cases but it is not a good factor to be taken to

judge the performance of algorithms. However, the testing set and the 10fold cross-validation

nave Bayes produced excellent results which conduct us to award the first place of this test

to nave Bayes. The results delivered for the errors were quite high, the reason might be the

heterogeneity of the medical data and the complexity of the values of each attribute, and this

might affect the performance of classification negatively. In our case multilayer perceptron

and especially decision tree algorithm might have been over trained for the disappointing

results obtained. Surprisingly, closet results to multilayer perceptron and nave Bayes were

yield by decision tree algorithm when training the hepatitis data set. This unique results,

allows us to look carefully to the hepatitis data set, which its almost attributes is binary. We

conclude that binominal data set is a good source of training data for tree decision algorithms.

When talking about the time of execution, here nave Bayes again is largely the winner

without exception, followed by decision tree then far from this two multilayer perceptron.

A final conclusion drawn for this comparison is that nave Bayes is the most suitablealgorithm for a medical data set in term of classification, this in both timing and

performances.


14/19

Data Mining report

14

Conclusion

As mentioned in the introduction, a huge amount of data is gathered daily then stored in

medical databases. These databases could contain a nontrivial dependencies symptoms and

diagnosis. The process of this data could be realised with the use of medical systems, this

medical system make it easy to uncover some unclear medical results. So with the betterdiagnosis and prediction knowledge, it is much easier for doctors to give an accurate

diagnosis and in a very short time for future cases.

The objective of the experiment is to identify and evaluate the performance of the most

suitable data mining algorithm implemented in modern Medical Decision Support System.

Three algorithms were chosen (Nave Bayes, multilayers perceptron, decision tree induction)

and four datasets were selected (Breast cancer, Hepatitis, Heart disease, Diabetes) to conduct

the experiment. It was crucial to use as much as possible the same measures and

configuration, in order to a correct classification and it was realised successfully

While the average accuracy classification of Nave Bayes is quite higher that the decision treeand multilayer perceptron, it would not fair to conclude that Nave Bayes is a better classifier

task that tree decision and multilayer perceptron. It is suggested that Nave Bayes classifier

has the potential to significantly improve the conventional classification methods for use in

medical or bioinformatics sector.


15/19

Data Mining report

15

References

[1] Huang, H. et al. Business rule extraction from legacy code, Proceedings of 20th

International Conference on Computer Software and Applications, IEEE

COMPSAC96, 1996, pp.162-167

[2] Chae Y. M., Kim H. S., Tark K. C., Park H. J., Ho S. H.,Analysis of healthcare quality

indicator using data mining and decision support system. Expert Systems with Applications,

2003, 167172.

[3]Duch W, Grabczewski K, Adamczak R, Grudzinski K, Hippe Z.S,Rules for melanoma

skin cancer diagnosis, 2001 http://www.phys.uni.torun.pl/publications/kmk/ retrieved on

4.04.2007

[4]Witten I. H., Frank E.,Data Mining, Practical Machine Learning Tools and Techniques,

2nd Elsevier, 2005

[5] Kamila, A.Evaluation of selected data mining algorithms implemented in Medical

Decision Support Systems, September 2007

[6] Available online athttp://www.cs.waikato.ac.nz/~ml/weka
http://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/weka


16/19

Data Mining report

16

Appendix

Appendix 1: Evaluation of performance of the data mining algorithms for the medical

databases for 10 folds crosses validation

Correct classified instances

Incorrect classified instances

Mean absolute error (MAE)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Nave Bayes

Multilayer p

C45

0%10%20%30%40%50%

60%70%80%90%

100%

Nave Bayes

Multilayer p

C45

0%10%20%30%40%50%60%70%80%90%

100%

Nave Bayes

Multilayer p

C45


17/19

Data Mining report

17

Root mean error squared (RMSE)

Precision

0%10%20%30%40%50%60%70%80%90%

100%

Nave Bayes

Multilayer p

C45

0%10%20%30%40%50%60%70%80%90%

100%

Nave Bayes

Multilayer p

C45


18/19

Data Mining report

18

Appendix 2: Statistic for predicting the diagnosis of the hepatitis data set (10 fold

cross-validations)

Nave BayesTotal Number of Instances 155

Correctly Classified Instances 110 (70.96 %)

Incorrectly Classified Instances 45 (29.03 %)

Kappa statistic 0.4026

Mean absolute error 0.2986

Root mean squared error 0.5061

Relative absolute error 60.26%

Root relative squared error 101.69%

Time taken to build model 0.02 seconds

Precision 71.2%

Number of class 2

Confusion MatrixSurvival Death

Predicted survival 70 (45.16%) 15 (9.67%)

Predicted death 30 (19.35%) 40 (25.80%)

Multilayer perceptron

Total Number of Instances 155




Mean absolute error 0.4297

Root mean squared error 0.6044



Time taken to build model 7.44 seconds

Precision 56.8%

Number of class 2

Confusion Matrix Survival Death


Predicted death 33 (21.29%) 37 (23.87%)


19/19

Data Mining report

19

Decision tree induction (C45)

Total Number of Instances 155




Mean absolute error 0.551Root mean squared error 0.6044



Time taken to build model 0.08seconds

Precision 57.7%

Number of class 2

Confusion MatrixSurvival Death


Predicted death 37 (23.87%) 33 (21.29%)

Documents

Data Mining Repport