Data Mining Repport

Embed Size (px)

Citation preview

  • 8/2/2019 Data Mining Repport

    1/19

    Department of Computer Science and Electronic Engineering, University of Essex

    Report of Data Mining

    Data mining classification techniques implemented in Medical Decision Support

    System

    CE802 Machine Learning and Data Mining

    Name: Bouchou Mohamed Oussama

    E-Mail:[email protected]

    Supervisor: Paul Scott

    Date: 15 January 2012

    AbstractThis paper, presents the analyse and the evaluation of some data mining classification

    algorithms which are mainly applied on the modern medical decision support systems (mdss).

    Various medical institutions store a very large quantity of data in order to extract relevant

    information; these might be precious for the future of science and medicine as well. To be

    able to extract the relevant data algorithms are needed. So three algorithms have been chosen

    to realise this experiment and WEKA was the program used to conduct these analyses, in

    addition to the four medical datasets employed. In fact it is hard to name a single algorithm to

    be suitable for this kind of experience. There was no big gap between the results. However, it

    reviles that Nave Bayes is the best classifier for such medical data.

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/2/2019 Data Mining Repport

    2/19

    Data Mining report

    2

    TABLE OF CONTENTS:

    ABSTRACT ........................................................................................................................................................ 1

    1- INTRODUCTION ...................................................................................................................................... 3

    2- THE OBJECTIVES OF THE EXPERIMENT .................................................................................................... 3

    3- DATA MINING ALGORITHMS CLASSIFIERS ............................................................................................... 4

    3-1DECISION TREE INDUCTION................................................................................................................................. 4

    3-2MULTILAYERS PERCEPTRON (NEURAL NETWORKS) ................................................................................................... 4

    3-3NAVE BAYES ................................................................................................................................................... 4

    4- DESCRIPTION OF THE DATASETS ............................................................................................................. 5

    4-1DATABASES DESCRIPTION.................................................................................................................................... 5

    4-1-1 Breast cancer database ....................................................................................................................... 5

    4-1-2 Hepatitis database ............................................................................................................................... 64-1-3 Heart disease database ....................................................................................................................... 6

    4-1-4 Diabetes database ............................................................................................................................... 7

    5- THE EXPERIMENT .................................................................................................................................... 8

    5-1WEKA ........................................................................................................................................................... 8

    5-2METHOD FOR EVALUATING THE ALGORITHMS ......................................................................................................... 9

    5-3THE APPLICATION OF THE ALGORITHMS TO THE DATA SETS....................................................................................... 10

    5-3-1 Breast cancer database ..................................................................................................................... 10

    5-3-2 Hepatitis database ............................................................. ................................................................ 10

    5-3-3 Heart disease database ..................................................................................................................... 11

    5-3-4 Diabetes database ............................................................................................................................. 115-4THE MEDICAL PREDICTION................................................................................................................................. 12

    5-5THE COMPARISON OF THE DATA MINING ALGORITHMS APPLIED TO MEDICAL DATABASES............................................... 13

    CONCLUSION ................................................................................................................................................. 14

    REFRENCES .................................................................................................................................................... 15

    APPENDIX ...................................................................................................................................................... 16

  • 8/2/2019 Data Mining Repport

    3/19

    Data Mining report

    3

    1- IntroductionThe healthcare is one of the richest sectors when talking about collecting information.

    Medical information, date and knowledge are incredibly increasing by the time. It has beenestimated that an acute care may generate five terabytes of data a year [1]. In fact, it is crucial

    and not evident to extract useful knowledge of information among this huge quantity of data.

    The data collected is stored in databases or even in data warehouses, these databases systems

    differs from one to another. In the last few decades a new type of medical system has

    emerged in the domain of medicine [2]. The data stored in these medical systems might

    contain a precious knowledge hidden in it. So and in order to process the data and retrieve a

    useful knowledge experiments are needed.

    It is undeniable that human making decision is almost always optimal especially when a

    small amount of data to be processed, but it becomes hard and non-exact when a bog amountof data has to be processed.

    The position cited above has pushed computer science engineers and medical staff to

    contribute and work together. The objective of this collaboration is to develop the most

    suitable method for data processing which will allow us to discover nontrivial rules. It results

    in improving the process of diagnosis and treatment in addition to reducing the risk of

    medical mistake.

    This experiment aims to identify and evaluate the most common data mining algorithms

    implemented in modern medical decision support system (mdss) [3]. A lot of experiments

    have been done in this field and they were assessed using different measures with differentdatasets, so it makes the comparison between algorithms almost impossible. This paper

    contrast and compares three of the common data mining classifiers algorithms (Nave Bayes,

    multilayers perceptron, decision tree induction). All the conditions and the configurations

    were prepared to make sure that the experiment will be conducted under the same conditions.

    2- The objectives of the experimentThe main objectives of this experiment is to evaluate three selected data mining

    classifiers algorithms which are commonly implemented in the medical decision support

    system. After getting the results, it is important to compare the performances of the

    algorithms and try to identify the most suitable and powerful one for the extraction of

    knowledge from the medical data.

    In the rest of this report

    A partial definition of the algorithms used, in addition to the presentation of the

    datasets

    Conduct the experiment under WEKA

  • 8/2/2019 Data Mining Repport

    4/19

    Data Mining report

    4

    Make a comparison between the performances of the algorithms and give a

    conclusion

    3- Data mining algorithms classifiersIn medicine, data mining is one of the tools used to search for a valuable pattern hidden.

    Therefor giving a precise diagnosis as well as gaining a precious time. This part will describe

    the three selected algorithms which are used by the experts commonly in such an experiment.

    3-1 Decision Tree Induction

    Decision trees are one of the most frequently used techniques of data analysis. The

    advantages of this method are unquestionable. Decision trees are, among other things, easy to

    visualize and understand and resistant to noise in data [4]. Commonly, decision trees are used

    to classify records to a specific class. Moreover, they are applicable in both regression and

    Association tasks.

    Decision trees are applied successfully in medicine for example it is used to classify the

    prostate cancer or breast cancer

    3-2 Multilayers perceptron (Neural networks)

    Neural network is a type ofartificial intelligence that attempts to imitate the way a human

    brain works. Rather than using a digital model, in which all computations manipulate zeros

    and ones, a neural networkworks by creating connections between processing elements, the

    computer equivalent of neurons. The organization and weights of the connections determine

    the output. Neural networks are particularly effective for predicting events when the networks

    have a large database of prior examples to draw on. Strictly speaking, a neural networkimplies a non-digital computer, but neural networks can be simulated on digital computers.

    In medicine diagnosis the input is the symptoms of the patient. The output however, is the

    prediction of different diseases [5].

    A type of an Artificial Neural Network depends on the basic unitperceptron. The

    Perceptron takes an input value vector and outputs 1 if the results are greater than a

    predefined threshold or 1 otherwise. The calculation is based on a linear combination of the

    input values.

    3-3 Nave Bayes

    The Nave Bayes is a simple probabilistic classifier. It is based on an assumption aboutmutual independency of attributes (independent variable, independent feature model).

    Usually this assumption is far from being true and this is the reason for the naivety of the

    method [5]. The probabilities applied in the Nave Bayes algorithm are calculated according

    to the Bayes Rule the probability of hypothesisHcan be calculated on the basis of the

    hypothesisHand evidence about the hypothesisEaccording to the following formula:

    (|) (|) ()

    ()

    In the practice, Nave Bayes has shown that it can work efficiently in a real situation, For

    instance in medicine it was used in diagnosis for the treatment of pneumonia.

    http://www.webopedia.com/TERM/A/artificial_intelligence.htmlhttp://www.webopedia.com/TERM/D/digital.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/C/computer.htmlhttp://www.webopedia.com/TERM/O/output.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/database.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/database.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/O/output.htmlhttp://www.webopedia.com/TERM/C/computer.htmlhttp://www.webopedia.com/TERM/N/neural_network.htmlhttp://www.webopedia.com/TERM/D/digital.htmlhttp://www.webopedia.com/TERM/A/artificial_intelligence.html
  • 8/2/2019 Data Mining Repport

    5/19

    Data Mining report

    5

    4- Description of the datasetsTo conduct the experiment four data bases were taken into consideration. In this part will

    be discussing the source of the databases in addition to the data bases details.

    The databases were taken from the famous UCI medical data repository, the reason for this

    choice was because data bases vary from one to another, they are taken from different clinics

    or hospitals, this permit us to compare and evaluate the performances of the algorithms in a

    real condition. Another reason for choosing UCI is to let other persons to do the same

    experiment and may be compare the results together.

    4-1 Databases description

    Here is the details description of the four databases chosen to realise the experiment they

    were all taken from UCI, in addition to the details of the values of every database.

    4-1-1 Breast cancer database

    The following web link represents the source from where the data were taken

    (http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-

    cancer-wisconsin.data)

    This table contains the details of the breast cancer database

    Dataset Number of

    attribute

    Number of

    symptoms

    Number of

    instances

    Number of

    classes

    Type of

    attributes

    Missing

    valuesBreast

    cancer

    11 9 699 2 Integer Yes

    Table 4-1: Breast cancer database details

    The following table summarise the values of the data base cited above

    Symptoms name Values of the

    attributes

    Patient number id Ex:1000025

    Clump Thickness 1 - 10

    Uniformity of Cell Size 1 - 10Uniformity of Cell Shape 1 - 10

    Marginal Adhesion 1 - 10

    Single Epithelial Cell Size 1 - 10

    Bare Nuclei 1 - 10

    Bland Chromatin 110

    Normal Nucleoli 1 - 10

    Mitoses 1 - 10Number of diagnosis(classes) 2: malignant

    4: benign

    Table 4-2: Attribute and their possible values of the Breast cancer database

    http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
  • 8/2/2019 Data Mining Repport

    6/19

    Data Mining report

    6

    4-1-2 Hepatitis database

    The following web link represents the source from where the data were taken

    (http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data)

    This table contains the details of the hepatitis database

    Dataset Number ofattribute

    Number of

    symptoms

    Number of

    instances

    Number of

    classes

    Type of

    attributes

    Missing

    values

    Hepatitis 20 19 155 2 Integer,

    Real

    Yes

    Table 4-3: Hepatitis database details

    The following table summarise the values of the data base cited above

    Symptoms name Values of the attributes

    Age 10, 20, 30, 40, 50.

    Sex 1: Male 0:FemaleAntivirals 1: Yes, 0: NoFatigue 1: Yes, 0: NoAnorexia 1: Yes, 0: NoLiver big 1: Yes, 0: NoLiver firm 1: Yes, 0: NoMalaise 1: Yes, 0: NoSteroid 1: Yes, 0: NoSpleen palpable 1: Yes, 0: NoSpiders 1: Yes, 0: No

    Ascites 1: Yes, 0: NoHistology 1: Yes, 0: NoVarices 1: Yes, 0: NoBilirubin 0.39, 0.40, 0.41..4Alk Phosphate 30, 90, 130..220, 300SGOT 13, 100, 200,400, 500Albumin 2.1, 3.0, 3.8, 4.5, 5.0, 6.0Protime 10, 20, 30, 40, 50, 60, 70, 80,

    90Number ofdiagnosis(classes)

    1: Alive

    2: Die

    Table 4-4: Attribute and their possible values of the Hepatitis database

    4-1-3 Heart disease database

    The following web link represents the source from where the data were taken

    (http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.

    data)

    This table contains the details of the Heart disease database

    http://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.%20datahttp://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data
  • 8/2/2019 Data Mining Repport

    7/19

    Data Mining report

    7

    Dataset Number ofattribute

    Number of

    symptoms

    Number of

    instances

    Number of

    classes

    Type of

    attributes

    Missing

    values

    Heart

    disease

    14 13 303 5 Integer,

    Real

    Yes

    Table 4-5: Heart disease database details

    The following table summarise the values of the data base cited above

    Symptoms name Values of the

    attributes

    Age 10, 20, 30, 40, 50.Sex 1: Male 0:FemaleChest pain type 1:typical angina

    2:atypical angina

    3:non angina pain

    4: asymptomatic

    Resting blood pressure in mm/Hg 94.200Serum cholesterol in mg/dl 126654Fasting blood sugar > 120mg/dl 1: Yes, 0: NoResting electrocardiographic results 0, 1, 2Maximum heart rate achieved 70200Exercise inducted angina 1: Yes, 0: NoST depression inducted by exercise

    relative to rest0, 0.16.01

    The slope of the peak exercise ST

    segment1, 2, 3

    Number of major vessels colored by

    fluoroscopy

    1, 2 ,3

    Thal 3: Normal

    6:Fixed defect

    7:Reversable defectNumber of diagnosis(classes) 0, 1, 2, 3, 4

    Table 4-6: Attribute and their possible values of the Heart disease database

    4-1-4 Diabetes database

    The following web link represents the source from where the data were taken

    (http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-

    indians-diabetes.data)

    This table contains the details of the diabetes database

    Dataset Number ofattribute

    Number of

    symptoms

    Number of

    instances

    Number of

    classes

    Type of

    attributes

    Missing

    values

    Diabetes 9 8 768 2 Integer,

    Real

    Yes

    Table 4-7: Diabetes database details

    http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.datahttp://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data
  • 8/2/2019 Data Mining Repport

    8/19

    Data Mining report

    8

    The following table summarise the values of the data base cited above

    Symptoms name Values of the attributes

    Age 10, 20, 30, 40, 50.

    Number of time pregnant 0, 1, 2, 3, 4.

    Plasma glucose concentration 0-199Diastolic blood pressure mm/Hg 24-122

    Triceps skin fold thickness mm 7-99

    Serum insulin muU/ml 14-850

    Body mass index kg 18-68

    Diabetes pedigree 0.01, 0.02..0.4..Number of diagnosis(classes) 0: non diabetic

    1: diabetic

    Table 4-8: Attribute and their possible values of the Heart disease database

    5- The Experiment

    5-1 WEKA

    WEKA is a data mining program developed by the University of Waikato in New Zealand

    that implements data mining algorithms using JAVA language. WEKA is a facility for

    developing machine learning (ML) techniques and their application to the real world datamining problems. It is a collection of machine learning algorithms for data mining tasks. The

    algorithms are applied directly to a dataset. WEKA implements algorithms for data pre-

    processing, classification, regression, clustering and association rules; in addition to the

    visualization tools. The machine learning schemes can also be developed with this package.

    WEKA is an open source software issued under General Public License (GPL) [6].

    The tree decision algorithm is called C4.5 algorithm, in WEKA it is called J48. It has a

    pruning feature which is used to remove sections of classifiers that is based on noise data. It

    results in reducing the phenomenon of over fitting.

    The Nave Bayes is called in WEKA Nave Bayes. It is a simple algorithm, known for its

    good performances especially when talking about a large data set because all the attributesare independent (conditional independence).

    The neural network algorithm used is called multilayers perceptron, it consists of a layers of

    nodes connected to others, except for the first one which is a neuron.

    Basically it classify well especially for numeric values.

  • 8/2/2019 Data Mining Repport

    9/19

    Data Mining report

    9

    5-2 Method for evaluating the algorithms

    There are several methods for analysing and extracting knowledge from data. In our case

    which is Medical Decision Support System, the correct and non-correct rate diagnoses should

    be analysed. This classification is taken seriously into consideration, for the important impact

    that might happen. In medical records the classification is not a certainty but a more accurate

    prediction for the reasons that we do not have the ability to say this person will die or no and

    when or this person is having cancer.

    It is crucial that the experiment follow certain conditions and measures, because one of the

    objectives is to determine what parameters setting yield the best model. Another aim of the

    experiment is to find an optimal setting in order to maximise the performances.

    During the experiment, we used the default data splits and fold cross validation. The wholetraining data set was kept in order to compares the algorithms in a fair environment.

    Regarding the split, the more data is used for the training the better the model is built. At the

    same time the more data testing is taken the more accurate is the result. So it is important to

    divide the data set into two categories training data and testing data. But the question is what

    the proportion of each one is? According the scientist the best results are obtained, when

    66% of the data are used to build the model and the rest which is 34% is used just for the test.

    For the fold cross validation, it says the experiments have proved that the best results are

    obtained 10 folds (this means split the data set into 10 equal pieces) is used.

    Every single algorithm was analysed under the same data split and cross validation

    parameters. In order to achieve a fairer result, all the algorithms were initialised with the

    same learning parameters during all the experiment.

    Here are some of the selected values and results which were taken during the experiment:

    Correct classified instances

    Incorrect classified instances

    Mean absolute error (MAE): this error has a tight relation with the classification

    accuracy. So if the MAE increase the classification accuracy prediction decease

    and vice versa

    Root mean squared error (RMSE): it is almost the same thing as the MAE.

    However the RMSE is considered as a better factor of classification than MAE

    Precision: the probability of being correct, giver a decision.

    Time of execution of the algorithm

  • 8/2/2019 Data Mining Repport

    10/19

    Data Mining report

    10

    5-3 The application of the algorithms to the data sets5-3-1 Breast cancer database

    Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Correct classified

    instances96.13% 95.37% 95.99% 99.28% 95.79% 95.85% 98.56% 95.37% 94.56%

    Incorrect classified

    instance3.86% 4.62% 4.00% 0.71% 4.20% 4.14% 1.43% 4.62% 5.43%

    Mean absolute error

    (MAE)3.85% 4.73% 4.03% 1.79% 4.71% 4.72% 2.83% 6.48% 6.91%

    Root mean error

    squared (RMSE)19.28% 21.23% 19.83% 8.66% 19.04% 19.15% 11.79% 21.14% 22.28%

    Precision 96.3% 95.4% 96.2% 99.3% 95.8% 95.9% 98.6% 95.4% 94.6%

    Time (seconds) 0.13 0.03 0.03 13.7 13.54 13.42 0.23 0.11 0.09

    Table 5-1: The performances of the three algorithms applied to the Breast cancer database

    The table shows that tree algorithms performed well applied to the breast database, where the

    smallest percentage is recorded for the C45 algorithm. This reflects the good precision as

    well as a small marge of error. The highest precision is recorded when it reaches the 99%,

    while the lowest point is 94%, this matches exactly the results obtain for the classification.

    The smallest root mean error squared belongs to multilayer perceptron with 19.15%, Nave

    Bayes errors comes after with 19.83%. Overall multilayer perceptron is slightly better than

    the others regarding the results, followed by Nave Bayes and then C45 algorithm. However

    when it comes to the time of execution, Nave Bayes is largely the winner with 0.03seconds.

    5-3-2 Hepatitis database

    Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Correct classified

    instances70.96% 71.69% 70.96% 96.77% 66.03% 56.77% 82.58% 69.81% 58.06%

    Incorrect classified

    instance29.03% 28.30% 29.03% 3.22% 33.96% 43.22% 17.41% 30.18% 41.93%

    Mean absolute error

    (MAE) 28.19% 30.22% 29.86% 6.12% 36.86% 42.97% 28.21% 38.13% 43.57%Root mean error

    squared (RMSE)48.65% 51.26% 50.61% 16.51% 54.41% 60.44% 36.59% 46.6% 55.1%

    Precision 71.1% 74.7% 71.2% 96.8% 66.5% 56.8% 82.8% 72% 57.7%

    Time (seconds) 0.01 0.01 0.02 7.47 7.02 7.74 0.08 0.08 0.08

    Table 5-2: The performances of the three algorithms applied to the hepatitis database

    The performances showed in this table are less convincing than the previous one which is

    allocated to breast cancer. One reason for this result is that the database might contain a

    bigger number of missing values than the breast database. Nave Bayes is the best classifier

    for this database, but still the result poor. Multilayer perceptron gets the best precision but

    just for the training set. However it gets the worst proportion the 10 fold cross-validation, in

    this case Nave Bayes is the best classifier followed by the C45 algorithm. For the time of the

  • 8/2/2019 Data Mining Repport

    11/19

    Data Mining report

    11

    execution Nave Bayes also has the shortest time with 0.01 seconds. O the other hand

    multilayer perceptron has average of 7 seconds time for the execution, which is much higher

    than Nave Bayes and even tree decision with an average of 0.08 seconds.

    5-3-3 Heart disease database

    Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training

    set

    Testing

    set

    (66%)

    10-fold

    cross

    validation

    Training

    set

    Testing

    set

    (66%)

    10-fold

    cross

    validation

    Training

    set

    Testing

    set

    (66%)

    10-fold

    cross

    validationCorrect classified

    instances63.36% 52.42% 56.43% 81.84% 54.36% 53.46% 78.54% 46.60% 52.47%

    Incorrect classified

    instance36.63% 47.57% 43.56% 18.15% 45.63% 46.53% 21.45% 53.39% 47.52%

    Mean absolute error

    (MAE)16.38% 19.8% 18.39% 9.22% 19.24% 19.07% 12.47% 21.79% 21.05%

    Root mean error

    squared (RMSE)30.66% 35.92% 33.97% 21.86% 39.9% 38.34% 24.92% 42.42% 40.11%

    Precision 62.3% 53.3% 54.9% 85% 52.3% 51.8% 77.5% 48.3% 47.5%

    Time (seconds) 0.02 0.02 0.02 13.85 13.01 12.22 0.13 0.11 0.11

    Table 5-3: The performances of the three algorithms applied to the heart disease database

    The algorithms were tasted with the third database which is heart disease heart. It confirms

    the result obtained in the previous evaluation. As it has been showed, that the algorithms

    performed well in the breast cancer database, then less perform ant in the hepatitis. The worst

    classification actually is the classification of the heart disease which has shown the worst

    result since the starting of the experiment. It has the highest proportion of the incorrect

    classification as well as the lowest percentages of correct classification. Concerning the

    timing, the rates still maintained stable and like always Nave Bayes is the fastest algorithm.

    On the other hand multilayer perceptron timing is far from its two concurrent and considered

    as the slowest.

    5-3-4 Diabetes database

    Classifiers Nave Bayes Multilayer Perceptron Decision Tree(C45)Methods Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Training

    set

    Testing

    set(66%)

    10-fold

    crossvalidation

    Correct classified

    instances76.30% 77.01% 76.30% 80.59% 74.32% 75.39% 84.11% 76.24% 73.82%

    Incorrect classified

    instance23.69% 22.98% 23.69% 19.40% 25.67% 24.60% 15.88% 23.75% 26.17%

    Mean absolute error

    (MAE)28.11% 26.6% 28.41% 28.52% 31.86% 29.55% 23.83% 31.25% 31.58%

    Root mean error

    squared (RMSE)41.33% 38.22% 41.68% 38.15% 44.45% 42.15% 34.52% 40.59% 44.63%

    Precision 75.9% 76.7% 75.9% 81.9% 75.6% 75% 84.2% 75.6% 73.5%

    Time (seconds) 0.02 0.02 0.03 12.12 12.96 11.76 0.11 0.11 0.14

    Table 5-4: The performances of the three algorithms applied to the diabetes database

  • 8/2/2019 Data Mining Repport

    12/19

    Data Mining report

    12

    The result of the last experience is quite satisfying better than the last two previous.

    Regarding the correct classified instances, the average is 75% and the lowest one is 73.8%

    which a good average, and the same thing is noticed for the precision. The root mean error

    squared results approves the percentage of classification. Nothing has changed for the

    execution time, it still remaining the approximate results for the algorithms, where NaveBayes comes the first and end up with tree decision.

    5-4 The medical prediction

    Scientist give a great intention to a medical prediction, this is way data mining is working

    to improve the medicine and the accuracy of diagnosis as well as prediction. The following

    table summarise a data mining prediction

    Survival Death

    Predicted survival TP FPPredicted death FN TN

    Table5-5: TP, FN, FP and TN prediction table

    An interpretation for the values could be useful. True Positive TP means the number of

    patients predicted to survive among survival patients. True Negative TN is the patients who

    are predicted to die among death patients. False Positive FP is the patients who are predicted

    to survive among death patients and False Negative FN is patients who are predicted to die

    among survival patients.

    As the experiment was done for four different databases, it was preferable to choose one andgive a small example of medical prediction. Hepatitis database was chosen for this medical

    prediction and the 10fold cross validation was chosen as a method for its best precision for

    the experiment done before. Number of patient is 155 so TP+FN+FP+TN=155. The

    following results were obtained relaying on the confusion matrix (see appendix 2).

    Nave Bayes Survival Death

    Predicted survival 70 (45.16%) 15 (9.67%)Predicted death 30 (19.35%) 40 (25.80%)

    Table5-6: prediction for the hepatitis by Nave Bayes algorithm

    Multilayer Perceptron Survival DeathPredicted survival 51 (32.9%) 34 (21.93%)

    Predicted death 33 (21.29%) 37 (23.87%)

    Table5-7: prediction for the hepatitis by multilayer perceptron algorithm

    Decision Tree (C45) Survival DeathPredicted survival 57 (36.77%) 28 (18.06%)

    Predicted death 37 (23.87%) 33 (21.29%)

    Table5-8: prediction for the hepatitis by Decision tree algorithm

  • 8/2/2019 Data Mining Repport

    13/19

    Data Mining report

    13

    5-5 The Comparison of the data mining algorithms applied to medical databases

    In this part we will be analysing and evaluating the results and performances of the

    algorithms which are drawn in the previous tables. Each algorithm was evaluatedindividually, for this reason it is possible to select the most suitable one for the classification

    of medical data. In fact three methods have been adopted but two of them are taken into

    serious consideration. The methods are the testing set and the fold10 cross-validation. Is it

    hard to name or state a random algorithm test configuration to be the best, but at the final

    stage, 10 fold cross-validation was selected among all the other configurations. The reason

    for choosing 10 fold cross-validations is simply because of its highest performances when it

    claims that better than any split. Its popularity to be used to classify such databases is also

    one of the reasons that made us choosing it rather than another configuration.

    A comparison can be done, relying on the results presented on the previous tables experiment

    and in appendix 1 (visualisation statistic). The results gained were almost all satisfying, apartthe heart disease results which is not a disaster but the worst among the others. At the second

    comes the hepatitis and diabetes nearly with similar proportions. Finally, the best results were

    gain belongs to the breast cancer which it interacts perfectly with the algorithms. This

    excellent result for the breast cancer allows us to deduce that this data set could be used for

    training data. Regarding algorithms, the boss is Nave Bayes when the majority of the

    proportions are superior to other results, followed by multilayer perceptron and then tree

    decision algorithm at the last place. For the training set multilayer perceptron and tree

    decision algorithm beat nave Bayes in all the cases but it is not a good factor to be taken to

    judge the performance of algorithms. However, the testing set and the 10fold cross-validation

    nave Bayes produced excellent results which conduct us to award the first place of this test

    to nave Bayes. The results delivered for the errors were quite high, the reason might be the

    heterogeneity of the medical data and the complexity of the values of each attribute, and this

    might affect the performance of classification negatively. In our case multilayer perceptron

    and especially decision tree algorithm might have been over trained for the disappointing

    results obtained. Surprisingly, closet results to multilayer perceptron and nave Bayes were

    yield by decision tree algorithm when training the hepatitis data set. This unique results,

    allows us to look carefully to the hepatitis data set, which its almost attributes is binary. We

    conclude that binominal data set is a good source of training data for tree decision algorithms.

    When talking about the time of execution, here nave Bayes again is largely the winner

    without exception, followed by decision tree then far from this two multilayer perceptron.

    A final conclusion drawn for this comparison is that nave Bayes is the most suitablealgorithm for a medical data set in term of classification, this in both timing and

    performances.

  • 8/2/2019 Data Mining Repport

    14/19

    Data Mining report

    14

    Conclusion

    As mentioned in the introduction, a huge amount of data is gathered daily then stored in

    medical databases. These databases could contain a nontrivial dependencies symptoms and

    diagnosis. The process of this data could be realised with the use of medical systems, this

    medical system make it easy to uncover some unclear medical results. So with the betterdiagnosis and prediction knowledge, it is much easier for doctors to give an accurate

    diagnosis and in a very short time for future cases.

    The objective of the experiment is to identify and evaluate the performance of the most

    suitable data mining algorithm implemented in modern Medical Decision Support System.

    Three algorithms were chosen (Nave Bayes, multilayers perceptron, decision tree induction)

    and four datasets were selected (Breast cancer, Hepatitis, Heart disease, Diabetes) to conduct

    the experiment. It was crucial to use as much as possible the same measures and

    configuration, in order to a correct classification and it was realised successfully

    While the average accuracy classification of Nave Bayes is quite higher that the decision treeand multilayer perceptron, it would not fair to conclude that Nave Bayes is a better classifier

    task that tree decision and multilayer perceptron. It is suggested that Nave Bayes classifier

    has the potential to significantly improve the conventional classification methods for use in

    medical or bioinformatics sector.

  • 8/2/2019 Data Mining Repport

    15/19

    Data Mining report

    15

    References

    [1] Huang, H. et al. Business rule extraction from legacy code, Proceedings of 20th

    International Conference on Computer Software and Applications, IEEE

    COMPSAC96, 1996, pp.162-167

    [2] Chae Y. M., Kim H. S., Tark K. C., Park H. J., Ho S. H.,Analysis of healthcare quality

    indicator using data mining and decision support system. Expert Systems with Applications,

    2003, 167172.

    [3]Duch W, Grabczewski K, Adamczak R, Grudzinski K, Hippe Z.S,Rules for melanoma

    skin cancer diagnosis, 2001 http://www.phys.uni.torun.pl/publications/kmk/ retrieved on

    4.04.2007

    [4]Witten I. H., Frank E.,Data Mining, Practical Machine Learning Tools and Techniques,

    2nd Elsevier, 2005

    [5] Kamila, A.Evaluation of selected data mining algorithms implemented in Medical

    Decision Support Systems, September 2007

    [6] Available online athttp://www.cs.waikato.ac.nz/~ml/weka

    http://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/wekahttp://www.cs.waikato.ac.nz/~ml/weka
  • 8/2/2019 Data Mining Repport

    16/19

    Data Mining report

    16

    Appendix

    Appendix 1: Evaluation of performance of the data mining algorithms for the medical

    databases for 10 folds crosses validation

    Correct classified instances

    Incorrect classified instances

    Mean absolute error (MAE)

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    Nave Bayes

    Multilayer p

    C45

    0%10%20%30%40%50%

    60%70%80%90%

    100%

    Nave Bayes

    Multilayer p

    C45

    0%10%20%30%40%50%60%70%80%90%

    100%

    Nave Bayes

    Multilayer p

    C45

  • 8/2/2019 Data Mining Repport

    17/19

    Data Mining report

    17

    Root mean error squared (RMSE)

    Precision

    0%10%20%30%40%50%60%70%80%90%

    100%

    Nave Bayes

    Multilayer p

    C45

    0%10%20%30%40%50%60%70%80%90%

    100%

    Nave Bayes

    Multilayer p

    C45

  • 8/2/2019 Data Mining Repport

    18/19

    Data Mining report

    18

    Appendix 2: Statistic for predicting the diagnosis of the hepatitis data set (10 fold

    cross-validations)

    Nave BayesTotal Number of Instances 155

    Correctly Classified Instances 110 (70.96 %)

    Incorrectly Classified Instances 45 (29.03 %)

    Kappa statistic 0.4026

    Mean absolute error 0.2986

    Root mean squared error 0.5061

    Relative absolute error 60.26%

    Root relative squared error 101.69%

    Time taken to build model 0.02 seconds

    Precision 71.2%

    Number of class 2

    Confusion MatrixSurvival Death

    Predicted survival 70 (45.16%) 15 (9.67%)

    Predicted death 30 (19.35%) 40 (25.80%)

    Multilayer perceptron

    Total Number of Instances 155

    Correctly Classified Instances 88 (56.77 %)

    Incorrectly Classified Instances 67 (43.22 %)

    Kappa statistic 0.1248

    Mean absolute error 0.4297

    Root mean squared error 0.6044

    Relative absolute error 86.72%

    Root relative squared error 121.11%

    Time taken to build model 7.44 seconds

    Precision 56.8%

    Number of class 2

    Confusion Matrix Survival Death

    Predicted survival 51 (32.9%) 34 (21.93%)

    Predicted death 33 (21.29%) 37 (23.87%)

  • 8/2/2019 Data Mining Repport

    19/19

    Data Mining report

    19

    Decision tree induction (C45)

    Total Number of Instances 155

    Correctly Classified Instances 65 (58.06 %)

    Incorrectly Classified Instances 67 (41.93 %)

    Kappa statistic 0.1436

    Mean absolute error 0.551Root mean squared error 0.6044

    Relative absolute error 87.93%

    Root relative squared error 110.71%

    Time taken to build model 0.08seconds

    Precision 57.7%

    Number of class 2

    Confusion MatrixSurvival Death

    Predicted survival 57 (36.77%) 28 (18.06%)

    Predicted death 37 (23.87%) 33 (21.29%)