Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Tuberculosis Detection and Localization from Chest X-Ray Images
using Deep Convolutional Neural Networks
By
Ruihua Guo
A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science (MSc) in Computational Sciences
The Faculty of Graduate Studies
Laurentian University
Sudbury, Ontario, Canada
© Ruihua Guo, 2019
ii
THESIS DEFENCE COMMITTEE/COMITÉ DE SOUTENANCE DE THÈSE
Laurentian Université/Université Laurentienne
Faculty of Graduate Studies/Faculté des études supérieures
Title of Thesis
Titre de la thèse Tuberculosis Detection and Localization from Chest X-Ray Images using Deep
Convolutional Neural Networks
Name of Candidate
Nom du candidat Guo, Ruiha
Degree
Diplôme Master of Science
Department/Program Date of Defence
Département/Programme Computational Sciences Date de la soutenance October 02, 2019
APPROVED/APPROUVÉ
Thesis Examiners/Examinateurs de thèse:
Dr. Kalpdrum Passi
(Supervisor/Directeur(trice) de thèse)
Dr. Ratvinder Grewal
(Committee member/Membre du comité)
Dr. Meysar Zeinali
(Committee member/Membre du comité)
Approved for the Faculty of Graduate Studies
Approuvé pour la Faculté des études supérieures
Dr. David Lesbarrères
Monsieur David Lesbarrères
Dr. Pradeep Atrey Dean, Faculty of Graduate Studies
(External Examiner/Examinateur externe) Doyen, Faculté des études supérieures
ACCESSIBILITY CLAUSE AND PERMISSION TO USE
I, Ruiha Guo, hereby grant to Laurentian University and/or its agents the non-exclusive license to archive and make
accessible my thesis, dissertation, or project report in whole or in part in all forms of media, now or for the duration
of my copyright ownership. I retain all other ownership rights to the copyright of the thesis, dissertation or project
report. I also reserve the right to use in future works (such as articles or books) all or part of this thesis, dissertation,
or project report. I further agree that permission for copying of this thesis in any manner, in whole or in part, for
scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their
absence, by the Head of the Department in which my thesis work was done. It is understood that any copying or
publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written
permission. It is also understood that this copy is being made available in this form by the authority of the copyright
owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted
by the copyright laws without written authority from the copyright owner.
iii
Abstract
Tuberculosis (TB) is a lung disease that is highly contagious and continues to be a major cause of
death during the past few decades worldwide. As the most efficient and cost-effective imaging
method for medical purposes, chest X-Rays (CXRs) have been widely used as the preliminary tool
for diagnosing TB. The automatic detection of TB and the localization of suspected areas which
contain the disease manifestations with high accuracy will greatly improve the general quality of
the diagnosis processes. This thesis discusses and introduces some methods to improve the
accuracy and stability of different deep convolutional neural network (CNN) models (VGG16,
VGG19, Inception V3, ResNet34, ResNet50 and ResNet101) that are used for TB detection. The
proposed method includes three major processes: modifications on CNN model structures, model
fine-tuning via artificial bee colony algorithm, and the implementation of the ensemble CNN
model. Comparisons of the overall performance are made for all three stages among various CNN
models on three CXR datasets (Montgomery County Chest X-Ray dataset, Shenzhen Hospital
Chest X-Ray dataset and NIH Chest X-Ray8 dataset). The tested performance includes the
detection of abnormalities in CXRs and the diagnosis of different manifestations of TB. Moreover,
class activation mapping is employed to visualize the localization of the detected manifestations
on CXRs and make the diagnosis result visually convincing. The implementation of our proposed
methods have the ability to assist doctors and radiologists in generating a well-informed decision
during the detection of TB.
iv
Keywords
Tuberculosis, chest X-ray, automatic detection, deep CNN model, artificial bee colony algorithm,
ensemble CNN model, class activation mapping
v
Acknowledgments
First of all, I would like to express my sincere thanks to my supervisor, Dr. Kalpdrum Passi, for
the time, advice and resources he has generously shared with me. In the early stage of my master
study, he provided me with insightful guidance on the direction of my research. During the process
of my studying, his great patience and consistent support assist me to overcome many difficulties
step by step. Without him, I would not have been able to complete all these works.
Secondly, I want to say thanks to my fantastic lab mate, Stefan Klaassen. Thank you for spending
time on brainstorming different ideas that help enrich my research and making the lab an enjoyable
place to work throughout the years. Also, I would like to give my special thanks to my boyfriend,
Yukun Shi, who supports me with guidance to solve the problems that I encountered and comforts
me with great patience when I was depressed.
Finally, I want to give my appreciation to my parents. It is with their understanding, support as
well as consistent encouragement both mentally and financially, I got the courage and
determination to pursue my second master in Canada. There is no me without you.
vi
Table of Contents
Abstract……………………………………………............................. iii
Acknowledgments…………………………………………………… v
Table of Contents…………………………………………………….. vi
List of Figures………………………………………………………... ix
List of Tables…………………………………………………………. xi
1 Introduction…………………………………………………………. 1
1.1 Research Background and Motivations…………………………... 1
1.2 Computer-aided Detection of Medical Images………………........ 2
1.3 Medical Image Preprocessing…………………………………….. 4
1.4 Image Classification and Disease Localization Techniques……… 4
1.5 Thesis Objectives and Outlines…………………………………... 6
2 Literature Review…………………………………………………... 8
3 Chest X-Ray Datasets and Image Enhancement Methods………. 13
3.1 Dataset Selection…..……………………………………………..
3.1.1 Montgomery County Chest X-Ray Dataset…………….
3.1.2 Shenzhen Hospital Chest X-Ray Dataset……………….
3.1.3 NIH Chest X-Ray8 Dataset……………………………..
13
13
14
15
3.2 Image Enhancement Methods…………………………………….
3.2.1 Histogram Equalization (HE)…………………………...
3.2.2 Contrast Limited Adaptive Histogram Equalization
(CLAHE)………………………………………………
18
18
19
4 Deep Learning Models For Image Classification…………………. 22
4.1 Deep Learning……………………………………………………. 22
4.2 Convolutional Neural Networks (CNN)………………………….. 23
vii
4.2.1 Basic CNN Structure…………………………………….
4.2.2 CNN Working Mechanism………………………………
4.2.2 Explicit Training of CNN………………………………..
24
28
30
4.3 Significance of Applying CNN in Medical Image Recognition…... 31
4.4 Advanced CNN Models Used in the Experiment………………….
4.4.1 VGGNet…………………………………………………
4.4.2 GoogLeNet Inception Model……………………………
4.4.3 ResNet…………………………………………………...
32
32
33
34
5 TB Detection via Improved CNN Models and Artificial Bee
Colony Fine-Tuning………………………………………………… 36
5.1 Transfer Learning…………………................................................ 36
5.2 Modifications of Advanced CNN Models………………………...
5.2.1 Modifications on CNN Architecture…………………….
5.2.2 Model Division with Different Learning Rate…………...
37
37
38
5.3 Fine-Tuning the CNN Model via Artificial Bee Colony…………..
5.3.1 Artificial Bee Colony……………………………………
5.3.2 CNN Model Fine-Tuning via Artificial Bee Colony…….
39
39
43
5.4 Experiment Settings………………………………………………
5.4.1 Experiment Description…………………………………
5.4.2 Ratio Comparison……………………………………….
5.4.3 Parameter Setting……………………………………….
46
46
47
48
5.5 Results and Discussion……………………………………………
5.4.1 CNN Modification and Division with Different Learning
Rates…………………………………………………….
5.4.2 Fine-Tune the Modified CNN Models via ABC…………
5.4.3 Discussion and Conclusion………………………………
50
50
61
66
viii
6 Increasing Accuracy of TB Detection via Ensemble Model………. 67
6.1 Ensemble Learning……………………………………………… 67
6.2 Ensemble Combination Methods Used for TB Detection………..
6.2.1 Linear Averaged Based Ensemble………………………
6.2.2 Voting Based Ensemble…………………………………
69
69
70
6.3 Experiment Descriptions…………………………………………. 71
6.4 Evaluation Metrics……………………………………………….. 71
6.5 Results and Discussion………..…………………………………..
6.5.1 Lung Abnormality Detection…………………………….
6.5.2 TB Related Disease Diagnosis…………………………..
72
72
85
6.6 Conclusion……………………………………………………….. 89
7 Disease Localization via Class Activation Mapping………………. 90
7.1 Class Activation Mapping………………………………………... 91
7.2 Experiment Setup………………………………………………… 93
7.3 Results and Analysis……………………………………………… 93
7.4 Conclusion……………………………………………………….. 102
8 Conclusion and Future Work……………………………………… 103
8.1 Conclusions……………………………………………………….
8.2 Future work……………………………………………………….
References……………………………………………………………
103
106
107
Appendix……………………………………………………………..
Appendix A: Public Chest X-Ray Dataset……………………………
Appendix B: Thesis Source Code……………………………………..
113
113
113
ix
List of Figures
Figure 1-1 TB Diagnose Pipeline……………………………………………….. 6
Figure 3-1 Sample Raw Images in Montgomery Count CXR Dataset………….. 13
Figure 3-2 Sample Clinical Readings for CXR Images…………………………. 14
Figure 3-3 Sample Raw Images in Shenzhen Hospital CXR Dataset……………. 14
Figure 3-4 Sample Raw Images in NIH ChestX-Ray8 Dataset………………….. 15
Figure 3-5 Sample images with bad quality in NIH ChestX-Ray8 dataset………. 16
Figure 3-6 Sample CXR and its enhanced results together with the
corresponding histogram…………………………………………….. 20
Figure 4-1 CNN structural evolution map………………………………………. 24
Figure 4-2 CNN structure based on LeNet-5……………………………………. 25
Figure 4-3 Process of convolution………………………………………………. 25
Figure 4-4 Classic pooling working principles………………………………….. 27
Figure 4-5 Fully connected layer neuron schematic diagram……………………. 28
Figure 4-6 Inception model with dimension reduction………………………….. 34
Figure 4-7 Shortcut connection of the residual block……………………………. 35
Figure 5-1 Modifications on CNN architecture…………………………………. 37
Figure 5-2 CNN division with different learning rates…………………………... 38
Figure 5-3 Flowchart of artificial bee colony algorithm………………………… 42
Figure 5-4 Fine-tune the trained CNN model via artificial bee colony algorithm.. 45
Figure 5-5 Averaged Accuracy Comparison on Montgomery County CXR
Dataset for Abnormality Detection………………………………….. 53
Figure 5-6 Averaged Accuracy Comparison on Shenzhen Hospital CXR Dataset
for Abnormality Detection…………………………………………... 55
Figure 5-7 Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for
Abnormality Detection……………………………………………… 58
x
Figure 5-8 Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for
TB Related Disease Detection……………………………………….. 60
Figure 6-1 Ensemble Model Structure…………………………………………... 68
Figure 6-2 Overfitted model and linear averaged model………………………… 70
Figure 7-1 Ensemble model structure…………………………………………… 92
Figure 7-2 Diagnosis and localization of consolidation…………………………. 94
Figure 7-3 Diagnosis and localization of effusion………………………………. 95
Figure 7-4 Diagnosis and localization of fibrosis……………………………….. 96
Figure 7-5 Diagnosis and localization of infiltration……………………………. 97
Figure 7-6 Diagnosis and localization of mass………………………………….. 98
Figure 7-7 Diagnosis and localization of nodule………………………………… 99
Figure 7-8 Diagnosis and localization of pleural thickening…………………….. 100
xi
List of Tables
Table 5-1 Hardware Deployments……………………………………………... 48
Table 5-2 Image Separations for Abnormality Diagnosis……………………… 48
Table 5-3 Original CXR Distribution with Specific TB Manifestations in Chest
X-Ray8………………………………………………………………. 49
Table 5-4 Augmented CXR Distribution in Chest X-Ray8 for TB Manifestations
Diagnosis……………………………………………. 51
Table 5-5 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=7:3 on Montgomery County CXR Dataset for Abnormality
Detection…………………………………………………………….. 51
Table 5-6 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=8:2 on Montgomery County CXR Dataset for Abnormality
Detection…………………………………………………………… 52
Table 5-7 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=9:1 on Montgomery County CXR Dataset for Abnormality
Detection…………………………………………………………….. 52
Table 5-8 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=7:3 on Shenzhen Hospital CXR Dataset for Abnormality
Detection……………………………………………………………... 54
Table 5-9 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=8:2 on Shenzhen Hospital CXR Dataset for Abnormality
Detection…………………………………………………………….. 54
Table 5-10 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=9:1 on Shenzhen Hospital CXR Dataset for Abnormality
Detection…………………………………………………………….. 55
Table 5-11 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=7:3 on NIH Chest X-Ray8 Dataset for Abnormality
Detection…………………………………………………………….. 56
Table 5-12 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=8:2 on NIH Chest X-Ray8 Dataset for Abnormality
Detection…........................................................................................... 57
Table 5-13 Valid Accuracy for Each Epoch During Training with Train/Valid
Ratio=9:1 on NIH Chest X-Ray8 Dataset…………………………... 57
xii
Table 5-14 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=7:3 on NIH Chest X-Ray8 Dataset for TB Related Disease
Detection…………………………………………………………….. 59
Table 5-15 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=8:2 on NIH Chest X-Ray8 Dataset for TB Related Disease
Detection…………………………………………………………….. 59
Table 5-16 Valid Accuracy of Each Epoch During Training with Train/Valid
Ratio=9:1 on NIH Chest X-Ray8 Dataset for TB Related Disease
Detection…………………………………………………………….. 60
Table 5-17 Ratio Validation and Testing Accuracy Results on Montgomery
County Chest X-Ray Dataset for Abnormality Detection……………. 62
Table 5-18 Ratio Validation and Testing Accuracy Results on Shenzhen Hospital
Chest X-Ray Dataset for Abnormality Detection……………. 63
Table 5-19 Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8
Dataset for Abnormality Detection…………………………………... 64
Table 5-20 Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8
Dataset for TB Related Disease Detection…………………………… 65
Table 6-1 Ratio Validation and Testing Accuracy Results on Montgomery Chest
X-Ray Dataset……………………………………………………….. 73
Table 6-2 Statistical Model Analysis with Train/Valid Ratio = 7:3 on
Montgomery Chest X-Ray Dataset…………………………………... 74
Table 6-3 Statistical Model Analysis with Train/Valid Ratio = 8:2 on
Montgomery Chest X-Ray Dataset………………………………….. 75
Table 6-4 Statistical Model Analysis with Train/Valid Ratio = 9:1 on
Montgomery Chest X-Ray Dataset………………………………….. 76
Table 6-5 Ratio Validation and Testing Accuracy Results on Shenzhen Hospital
Chest X-Ray Dataset…………………………………………………. 77
Table 6-6 Statistical Model Analysis with Train/Valid Ratio = 7:3 on Shenzhen
Hospital Chest X-Ray Dataset……………………………………….. 78
Table 6-7 Statistical Model Analysis with Train/Valid Ratio = 8:2 on Shenzhen
Hospital Chest X-Ray Dataset………………………………………. 79
Table 6-8 Statistical Model Analysis with Train/Valid Ratio = 9:1 on Shenzhen
Hospital Chest X-Ray Dataset……………………………………….. 80
xiii
Table 6-9 Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8
Dataset……………………………………………………………….. 81
Table 6-10 Statistical Model Analysis with Train/Valid Ratio = 7:3 on Chest X-
Ray8 Dataset…………………………………………………………. 82
Table 6-11 Statistical Model Analysis with Train/Valid Ratio = 8:2 on NIH Chest
X-Ray8 Dataset………………………………………………………. 83
Table 6-12 Statistical Model Analysis with Train/Valid Ratio = 9:1 on NIH Chest
X-Ray8 Dataset………………………………………………………. 84
Table 6-13 Ratio Validation Accuracy and Testing Results on NIH Chest X-Ray8
Dataset for Specific TB Related Disease Diagnosis…………………. 85
Table 6-14 AUC Scores with Train/Valid Ratio = 7:3 on NIH Chest X-Ray8
Dataset……………………………………………………………….. 86
Table 6-15 AUC Scores with Train/Valid Ratio = 8:2 on NIH Chest X-Ray8
Dataset……………………………………………………………….. 87
Table 6-16 AUC Scores with Train/Valid Ratio = 9:1 on NIH Chest X-Ray8
Dataset……………………………………………………………….. 88
1
Chapter 1
Introduction
1.1 Research Background and Motivations
Tuberculosis (TB) is one of the most common and deadliest infectious diseases worldwide. In
modern society, TB has become the leading killer infectious disease, followed by malaria and
HIV/AIDS.
According to the World Health Organization (WHO), in 2017, 10 million people got ill with TB
and 1.6 million died from the disease [1]. Among the people who developed TB, most of them
come from developing countries with poor healthcare resources and medical infrastructures. Over
half of the deaths occurred because of the late detection which cause the patients to miss the best
therapeutic opportunity. Research shows that with an earlier diagnosis of TB and proper treatment,
the death rate can be greatly reduced. Thus, the early detection of TB is critical on improving
disease prevention, mitigating disease transmission and minimizing the death rate.
For the detection of TB, the best-known method is through Computer Tomography (CT).
Considering the radiation dose, cost, availability and the ability to reveal the unsuspected
pathologic alterations, the earliest diagnosis of TB are confirmed via chest X-Rays (CXRs) in most
cases. However, unlike other lung diseases which only shows single manifestation in CXRs, TB
is a more complicated disease which contains multiple manifestations such as consolidation [2],
effusion [3], fibrosis [4], infiltration [5], mass [6], nodule [7] and pleural thickening [8]. This large
variations in pathology increase the difficulty of TB detection and therefore influence the accuracy
2
of the judgment given by the doctors. Besides, it is hard to distinguish between lung abnormalities
and soft tissues with similar textures without professional training and long-term experience. In
resource-poor and marginalized areas, due to the lack of healthcare funding support and advanced
medical infrastructures, the number of radiologists who receive professional training is very
limited and the CXRs generated are low quality, which leads to a negative impact on the time delay
in TB detection as well as diagnostic accuracy. Moreover, the potential fatigue brought by the
large volume of workload on reading CXRs makes it even harder for human experts to finish their
task with a stable efficiency and quality.
Therefore, methods that can help to reduce the time delay in TB diagnosis with high quality and a
lower economic cost are needed to improve the treatment as well as minimize the reproduction of
the disease at an early stage.
1.2 Computer-aided Detection of Medical Images
As the main application of medical image pattern recognition technology, computer-aided
detection (CAD) of medical images is an interdisciplinary technology which employs the
principles from various fields, such as medical science, computer science, mathematics and
statistics etc. Relying on the high-speed computing power of the computer, CAD becomes a
powerful automated tool in pattern recognition, medical information processing and analyzes.
In general, the application of CAD in preliminary disease diagnosis based on medical images has
three main advantages. First of all, data processing via computer is more effective, the automatic
processing of quantitative data will ensure the quality of the task while improving the efficiency
to a large extent. Secondly, with the development of science and technology, the cost of computing
hardware keeps decreasing while the general performance continuously getting better. Therefore,
3
applying CAD tools for medical purposes is cost-effective, which will benefit especially for people
living in resource-poor areas. Moreover, various diagnosis results might be given by different
doctors for the same CXR because of their different medical experience and understanding of
certain diseases. Under the influence of these subjective factors, it is hard to get an objective and
unbiased judgement. With the assistance of this intelligent system, medical information exchange
among human experts and doctors in different regions can be achieved, which may help generate
a well-informed decision.
To build a CAD system that can achieve a satisfying diagnosis result, the selection and extraction
of useful pathogenic features are critical. Over the past few decades, researchers explored different
algorithms during the development of the automated system. For example, the scale-invariant
transform (SIFT) algorithm [9] has been studied and implemented to detect the local geometric
features of the image, local binary patterns (LBP) [10] is proposed and applied to extract the texture
features etc. However, traditional feature selection algorithms mainly depend on the artificial
extraction of important patterns which may contain useful information. This manual selection
process is time-consuming. Moreover, as the medical image data grows and the type of diseases
keeps increasing, problems such as having poor transferability among different datasets and
achieving unstable performances to newly generated data have stopped the CAD system from the
generation of diagnosing decisions with high quality.
Nowadays, with the rapid development in deep learning, it has continuously surpassed the
traditional recognition algorithms and achieved superhuman performance in image-based
classification and recognition problems. The superior ability of automatically extracting useful
features from the inherent characteristics of data makes deep learning the first choice for solving
medical problems. So far, CAD systems embedded with deep learning algorithms have been
4
widely studied and applied for disease prediction and highlighting suspicious features to help
effectively maintain the quality of diagnosis.
1.3 Medical Image Preprocessing
Different medical imaging infrastructures provide medical images with different qualities, to build
a CAD system with good transferability and high quality of diagnosis, image preprocessing has
become an indispensable step to improve the overall quality of data [11]. The main purpose of
image preprocessing is to enhance features on the images globally by suppressing the unwanted
distortions and make the region of interests more obvious.
Among all the image preprocessing methodologies, image enhancement which encompasses the
processes of editing images has great advantages for the general improvement of grayscale images.
It mainly focuses on highlighting the information of interest within the image, enhancing the
relative sharpness of the image by improving visual effects and making the image more conducive
to computer processing.
Two image enhancement methods, histogram equalization and contrast limited adaptive histogram
equalization are applied in our experiment and have been compared in parallel to get the better
solution for the image preprocessing step. Histograms of pixels values for each method have also
been recorded for analysis.
1.4 Image Classification and Disease Localization Techniques
In deep learning field, disease diagnosis is achieved by medical image classification via
convolutional neural network models. In general, deeper neural network architecture helps get a
better prediction accuracy. However, with the increasing depth and complexity of the network
5
model, accuracy gets saturated and then degrades rapidly, this is known as the vanishing gradient
problem [12, 13]. Advanced CNN models such as VGGNet, InceptionNet, ResNet etc. which
achieved good performances during the world image classification competition successfully
reduced the negative influence caused by this problem through their unique model structures. Thus,
the improved CNN models based on this advanced architecture with previously trained parameters
on different categories of daily objects become the first choice of building a CAD system for
disease detection.
Deep learning algorithms require the training and testing data to have the same set of features and
distribution. However, this assumption may not hold in many real-world applications. The
performance of the learners can be enhanced by using knowledge transfer [14.15], which would
transfer the knowledge from one set of medical image dataset to another. In our experiment,
coarse-to-fine knowledge transfer learning is used to train CXRs in a specific dataset for TB
diagnosis on different advanced CNN models. Details will be given in chapter 4, chapter 5 and
chapter 6.
One big problem when applying CNN models for medical purpose is the “black box” working
mechanism which makes the result hard to understand. Moreover, for disease diagnosis via
medical images, people will not only pay attention to the predicted result but also the localization
of suspected abnormalities. In our study, class activation mapping is implemented to specify the
area which contains the diseased region to make the predicted result easier to understand and thus
further improve the general performance of the CAD system.
6
1.5 Thesis Objectives and Outlines
The primary objective of the thesis is to develop an efficient CAD system for the detection and
localization of TB from CXRs. As is shown in Figure 1-1, the diagnosis of TB mainly contains
four steps: CXR image preprocessing, preliminary detection of the suspected TB patients through
abnormality checking, identification of the specific TB manifestations, and the localization of the
suspected disease on CXRs.
Figure 1-1: TB Diagnose Pipeline
In this study, we mainly focus on improving the accuracy of disease detection as well as the
stability of the general performance by modifying the structure of the six different advanced CNN
models, adding the fine-tuning step via artificial bee colony algorithm and ensemble the trained
CNN models with two methods. Moreover, class activation mapping is implemented to better
visualize the detection result effectively and efficiently.
The thesis is organized as below:
Chapter 1 includes the overall introduction of the thesis study, which mainly contains the
background, motivations, research work and objectives of TB diagnosis via CXRs. Literature
7
review will be provided to discuss the previous research done on medical image classification and
disease localization in Chapter 2.
CXR datasets that have been used in thesis study and the image enhancement methods for image
preprocessing will be introduced in Chapter 3.
In Chapter 4, basic concepts of deep learning will be presented, followed by the detailed
explanation of CNN model structures and the internal working mechanism.
Chapter 5 will propose a modified structure based on a series of selected advanced CNN models
(VGG16, VGG19, Inception V3, ResNet34, ResNet50 and ResNet101) as well as introducing the
artificial bee colony algorithm, a metaheuristic optimization method that will be used during the
fine-tuning of the trained CNN models.
Concept of ensemble learning which focuses on improving the stability and general performance
on TB detection will be introduced in Chapter 6.
Research work on the visualization of disease location from CXRs will be discussed in detail in
Chapter 7.
Conclusion, future work and other contents will be mentioned in Chapter 8.
8
Chapter 2
Literature Review
The main task of computer-aided diagnosis in medical fields is to assist doctors during the
interpretation of medical images. Pattern recognition technology plays a critical role during the
building of a disease diagnosis system. The studying of different extraction methods on possible
features such as textures, homogeneity, contrast, shapes, outlines, as well as their combinations
have been done and applied by researchers on various diseases. For example, Khuzi et al. employed
the gray-level co-occurrence matrix (GLCM), a texture descriptor via the spatial relationship
between different pixel pairs to identify masses from mammograms [16]. Local binary pattern has
been discussed and implemented in [17] to produce normality/pathology decision based on chest
X-Rays (CXRs). In [18], Yang successfully applied gray-scale invariant features for the detection
of tumor from breast ultrasound images.
However, traditional feature extraction methods are problem-specific and mainly relies on the
manual processing of medical images. With the constant discovery of new diseases and the
continuously increasing number of CXRs generated every day, the lack of the ability to make
representations of high-level concepts and poor efficiency will limit the model’s ability of
generalization as the data continues to expand. These problems have been solved by the appearance
of CNN. In the past decade, researchers have conducted many in-depth studies on the application
of CNN models in disease diagnosis from various type of medical images. In [19], deep CNN
model has been applied to help to identify the distinctions between Alzheimer’s brain and healthy
brain from magnetic resonance imaging (MRI) as well as functional MRI data for both clinical and
research purposes. The paper mentioned that this architecture provides a significant improvement
9
by achieving a high and reproducible accuracy rate of over 95%. Benjamin Q. Huynh presented
an improved deep learning model in [20] based on CNN and applied it on digital mammographic
tumor detection. The concept of transfer learning has also been proposed and used in this paper.
[21] presented a 3D CNN model for the diagnosis of attention deficit hyperactivity disorder
(ADHD) through the recognition and analyes of the local spatial patterns from brain MRIs. As for
lung disease detection, deep CNN model with dense structure has been developed and employed
in [22] for the diagnosis of interstitial lung diseases from CT image. The proposed model can
distinguish six different disease manifestations together with healthy cases with an accuracy of
over 80%. [23] compared the performance between CNN and traditional classification method
SVM on the recognition of five interstitial lung diseases based on the CT image data and the region
of interest provided by radiologists. Results showed that CNN presents a significantly improved
accuracy and greater efficiency compared to the SVM classifier. In [24], Jaiswal et al. implemented
a Mask-RCNN model which combines both the global and regional features on the identification
of pneumonia. Their approach achieves a diagnostic accuracy of over 90% and a visible
localization of the disease is given using the regional information extracted from the bounding
box.
Apart from lung diseases with single manifestations, CNN also provides a good ability in the
diagnose of TB. In [25], two deep CNN models, AlexNet and GoogLeNet, have been applied to
classify the chest radiographs provided by Shenzhen Hospital Chest X-Ray dataset and
Montgomery County Chest X-Ray dataset as pulmonary TB or healthy cases. In this study, datasets
have been split into training, validation and testing sets, with a proportion of 68.0%, 17.1% and
14.9% respectively. Receiver operating characteristic curves (ROC) and areas under the curve
(AUC) is used to analyze the overall performance statistically. In their experiments, the best
10
classifier achieves an impressing AUC of 0.99. Later on, Pasa et al. in [26] presented the automated
diagnosis as well as the localization of TB on the same two datasets using a deep CNN with the
shortcut connection. The best AUC achieved in their experiment is 0.925, not as good as the
previous work, but localization result they provide is quite impressive. Considering the amount of
data is small and not representative enough in the previously used two datasets, Hwang et al. [27]
expand their research from these 2 datasets to a larger dataset which contains over 10,000 images.
During their experiment for diagnosing TB, they achieved an accuracy of 83.1%. 83.4% and 83.0%
on the 3 datasets respectively. However, in all of these 3 works, none of them test the model’s
performance on diagnoses among multiple TB-related diseases, which is a more difficult and
practical task that need to be solved. In 2017, [28] proposed a 121-layer dense CNN architecture
and tested the model by training with the currently largest publicly available chest radiography
dataset, NIH ChestX-ray14 dataset, to detect over ten different lung diseases. The performance
achieved by the CNN model has been compared to that by radiologists and the result measured by
F1 metric is more superior compared to human experts. According to their result, the proposed
model achieves an F1 score of 0.435 which is much higher compared to the average performance
given by human experts of 0.387. Yet, without providing the classification accuracy for each
disease, the robustness of their result is unknown.
Most of the work mentioned above focus on introducing the general applications of CNN on
disease diagnosis via the direct implementation as well as improving the detection accuracy by
employing new methods on certain stage of the classification process or by making changes to the
CNN model structures. Shin discussed using off-the-shelf pre-trained CNN features from natural
image dataset and then trained with medical images, fine-tuning the model to complete lung
disease diagnosis tasks in [29]. The discussion of major techniques and the process of applying
11
CNN to medical image recognition as well as the demonstration of potential advantages in
applying transfer learning inspired us. In our study, we explore three important but previously
understudied aspects, the inner structure of the CNN model, the learning and fine-tuning process
of CNN models pre-trained from natural image dataset using provided medical image data and the
organization of the outputs given from the trained CNN models, to improve the stability and overall
performance on TB diagnosis. To evaluate the performance of classifiers [30] recommends a
comprehensive assessment via accuracy, precision, recall, F-measure and AUC. For the diagnosis
of multiple diseases, the confusion matrix is recommended in [31] to generalize intuitive
classification results.
In image classification tasks, the first problem needs to be considered is image processing, a basic
but critical data preprocessing step to improve the overall quality of the data. [32] introduces
various histogram-based image enhancement methods for the processing of medical images which
includes histogram equalization (HE) and contrast limited adaptive histogram equalization
(CLAHE). CLAHE has been applied in [33] for the detection of pulmonary Tuberculosis via CNN
and received a superior result compared to the state-of-the-art.
In [34], metaheuristic algorithms have been introduced as modern optimization techniques. Three
metaheuristic approaches, simulated annealing, differential evolution, and harmony search have
been implemented on CNN models to classify hand-written digits and pictures of daily objects.
Compared to the CNN models trained by original optimizers, the proposed method shows an
improvement of accuracy up to 7.14%. [35] presents an introduction of the artificial bee colony
(ABC) algorithm, a metaheuristic method inspired by the foraging behavior of honeybees, and its
advantage on edge detection in CNN based image classification. Moreover, research on applying
metaheuristic algorithms for mass detection from mammograms have been done in [36] and
12
received improved classification results. From these studies, metaheuristic algorithms have
become a possible solution to improve the performance of CNN models in medical fields.
When we move our attention to further improve the overall performance as well as the stability of
CNN models by organizing the generated outputs, Islam’s paper [37] gets our attention. This paper
shows that the ensemble of multiple CNN models has an inherent advantage of constructing a non-
linear decision-making system and lead to a promising improvement in visual recognition.
Moreover, they also present the lung disease localization using heat map obtained from the
occlusion sensitivity. [38] proposed an object detection method using class activation mapping
which effectively makes use of the trained parameters from the CNN model and gives the location
of the detected object on the given image to make the results given by a CNN model easier to be
understood.
In our study, we propose a computer-aided detection model for TB diagnosis with various
improvements based on three aspects which correspond to the model structure as well as the
learning process of CNN models. We cover layer modifications, model fine-tuning via artificial
bee colony (ABC) algorithm and overall performance improvement by CNN model ensemble
method. The exploration of how class activation mapping works on visualizing the location of TB
manifestations for single and the ensembled CNN models is provided for disease localization task.
13
Chapter 3
Chest X-Ray Datasets and Image Enhancement Methods
3.1 Dataset Selection
3.1.1 Montgomery County Chest X-Ray Dataset
The Montgomery County Chest X-Ray dataset [39] is created by the National Library of Medicine
together with the Department of Health and Human Services, Montgomery County, Maryland,
USA, from patients who joined the Tuberculosis (TB) screening program. This dataset contains
138 frontal posteroanterior CXR images in Portable Network Graphics (PNG) format, in which 80
images belong to normal cases and the other 58 are diagnosed as having TB manifestation. Images
in this dataset are 12-bit grey level images with the size of either 4020 4892 or 4892 4020
pixels. Some sample raw images in this dataset are given in Figure 3-1.
Figure 3-1: Sample Raw Images in Montgomery Count CXR Dataset
The image file names are coded as ‘MCUCXR_****_X.png’, where ‘****’ represents a unique
ID number for each patient, and X can be either 0 or 1 which represents the normal or abnormal
case respectively. The clinical readings that include patient’s age, sex and abnormality information
MCUCXR_0060_0.png MCUCXR_0063_0.png MCUCXR_0243_1.png MCUCXR_0387_1.png
14
are saved as text files with the same image file names. Below give some examples of clinical
readings.
Figure 3-2: Sample Clinical Readings for CXR Images
Before the experiments, images which have large black background have been cropped and all
images have been resized to 512 512 pixels.
3.1.2 Shenzhen Hospital Chest X-Ray Dataset
The Shenzhen Hospital Chest X-Ray dataset [39, 40, 41] is created by the National Library of
Medicine, Maryland, USA in collaboration with Shenzhen No.3 People’s Hospital and Guangdong
Medical College in China. This dataset contains 662 frontal posteroanterior CXR images in various
sizes, among which 326 are diagnosed as normal cases and 336 are the cases with TB
manifestation. Since it has been created by the same institution as Montgomery County Chest X-
Ray dataset, the image format, file names and clinical readings follow the same rule.
Figure 3-3 presents some sample raw images in Shenzhen Hospital Chest X-Ray dataset.
Figure 3-3: Sample Raw Images in Shenzhen Hospital CXR Dataset
Patient's Sex: F Patient's Age: 027Ynormal
Patient's Sex: M Patient's Age: 016Yimproved LUL infiltrate and cavity. Active TB on therapy.
Normal Case Abnormal Case
CHNCXR_0005_0.png CHNCXR_0021_0.png CHNCXR_0552_1.png CHNCXR_0562_1.png
15
Unlike Montgomery County Chest X-Ray dataset, no images in this dataset contain the large black
background. Thus, they have been resized to 512 512 pixels without cropping before the
experiments.
3.1.3 NIH Chest X-Ray8 Dataset
The NIH ChestX-Ray8 dataset [42] is so far one of the largest public chest x-ray databases for
thorax disease detection study purposes. This dataset is extracted from the clinical PACS database
at National Institutes of Health Clinical Center and contains 112,120 frontal view chest x-ray
images of 30,805 unique patients with 14 thoracic pathologies (atelectasis, consolidation,
infiltration, pneumothorax, edema, emphysema, fibrosis, effusion, pneumonia, pleural thickening,
cardiomegaly, nodule, mass and hernia). All images in this dataset have already been preprocessed
to the same size of 1024 1024 for convenient purposes. Figure 3-4 presents some sample raw
images in this dataset which includes normal as well as TB related cases.
Figure 3-4: Sample Raw Images in NIH ChestX-Ray8 Dataset
00000029_000.pngNo Finding
00020471_002.pngConsolidation
00000144_003.pngEffusion
00019683_000.pngFibrosis
00001992_007.pngInfiltration
00002419_003.pngMass
00000971_002.pngNodule
00000280_003.pngPleural Thickening
16
Clinical readings which contain patient’s ID, follow-up number, age, sex, view position,
abnormality information have been organized by image names and saved in one Comma Separated
Values (CSV) file called ‘Data_Entry_2017’.
Although the NIH ChestX-Ray8 dataset has been widely used among researchers into the deep
learning area for medical purposes because of its large amount of data, detailed annotations and
wide range of thorax diseases it covered, there are still some problems.
The first and biggest problem is that the quality of the chest x-ray images varies a lot which greatly
increases the workload of data cleaning. Images with side views, images that do not contain much
useful information at lung part, rotated images and images with bad pixel quality are all need to
be removed at the beginning. Otherwise, these ‘bad data’ will inference the learning process of the
deep learning models and therefore influence the overall performance on disease diagnosis.
Sample images that contain the previously mentioned problems are given in Figure 3-5.
Sample images with side view Sample images do not contain much lung part
17
Figure 3-5: Sample images with bad quality in NIH ChestX-Ray8 dataset
Another problem is that, according to [43], since the original radiology report is not anticipated to
be publicly shared, disease info and labels for images are text-mined via natural language
processing techniques with an accuracy over 90%. The mismatched labels and images will also
bring difficulties for researchers, especially those without any medical background.
Moreover, problems such as the greatly unbiased number of images for each thorax disease,
different characteristics of the two view positions (posteroanterior and anteroposterior) etc. are all
need to be considered.
According to the experimental needs, only PA images with no finding labels and the TB related
manifestations (consolidation, effusion, fibrosis, infiltration, mass, nodule, pleural thickening) are
selected. Among them, images with bad quality as stated above have been removed, and the rest
have been resized to 512 512 pixels before the experiment.
Sample rotated images Sample images with bad pixel quality
18
3.2 Image Enhancement Methods
Image enhancement plays an essential role in image processing fields. In the process of image
formation, transmission and transformation, the image quality might be reduced with blurry
features due to various external factors, which makes the image recognition and analysis work
greatly difficult. Hence, attenuating the unnecessary features while highlighting the necessary ones
based on specific needs to improve the visibility of images becomes the main research content of
image enhancement.
In medical image-based disease diagnosis, doctors make judgements based on some specific
features displayed in the image. Generally, the human eye is sensitive to high-frequency signals
that contain most of the detail information. However, in medical images, high-frequency signals
are often embedded in a large amount of low-frequency background signals and thereby reduce
their visibility. Therefore, to better-facilitating disease diagnosis, it is possible to enhance the
contrast by appropriately increasing the high-frequency portion of the image via image
enhancement methods.
3.2.1 Histogram Equalization (HE)
The main idea of histogram equalization (HE) is to change the pixel histogram of the original
image from a concentrated grayscale range to a uniform distribution in the whole range [44]. This
method includes the following steps:
(1) Count the total number of pixels, 𝑛𝑖, in each grayscale level from the input image, where 𝑖 =
0, 1, … , 255 is the possible pixel values of a grayscale image.
19
(2) Calculate the cumulative distribution function: 𝑃(𝑗) = ∑ 𝑃(𝑘),𝑖𝑘=0 𝑖 = 0, 1, … , 255 . This
function is guaranteed to be a monotonically increasing function with the consistency of the
dynamic range of the grayscale values during the image transformation.
(3) Obtain the equalized pixel values using the function 𝑗 = 𝑃(𝑗) × 256 + 0.5 and round this
value to its closest integer.
(4) Finally, replaced each pixel with its corresponding equalized value to obtain the equalized
image.
By performing the nonlinearly stretch and pixel value redistribution as described above, the
number of pixels in a certain grayscale range will be approximately the same. This more even
distribution of pixel values over the histogram helps to increase the background contrast and
therefore improves the visual effect of the image.
3.2.2 Contrast Limited Adaptive Histogram Equalization (CLAHE)
As the combination of contrast-constrained method and adaptive histogram equalization method,
contrast limited adaptive histogram equalization (CLAHE) [45] overcomes the problem exists in
HE method that it cannot perform the optimization of the local image contrast.
The main idea of CLAHE is to perform HE on the input image through a sliding window and
combine the histograms inside and outside the window, the height of the resulted histogram is then
controlled by the clip limit value to suppress the background noise from being excessively
amplified. However, since the processing of the image sub-regions may lead to an unevenly
distributed pixel, interpolation is required in each sub-region at the very last step.
Figure 3-6 presents a sample CXR and its enhanced images via HE and CLAHE, the corresponding
distribution of the pixel value are also provided via histogram for statistical comparison.
20
Figure 3-6: Sample CXR and its enhanced results together with the corresponding histogram
As shown in the picture above, the CXR image enhanced by HE has a histogram with a more
equalized distribution of pixels over the total grayscale range compare to the one from raw CXR.
Raw Input Image and Its Histogram
Image Enhanced by HE and Its Histogram
Image Enhanced by CLAHE and Its Histogram
21
The enhanced image presents a prominent view of the overall texture and blood vessels. However,
the background noise in the CXR has also been significantly enhanced in this case. Meanwhile,
CXR enhanced by CLAHE has the histogram that not only has a uniformed distribution but also
maintains the same trends of the concentration as in the raw CXR. The local details in the
corresponding enhanced CXR become clearer while the background noise has been suppressed so
that the total quality of the CXR image is increased.
Therefore, CXRs are all enhanced by CLAHE with the clip limit number equals to 1.25 before the
experiment.
22
Chapter 4
Deep Learning Models for Image Classification
4.1 Deep Learning
As a new research field in machine learning, the concept of deep learning was first proposed by
Geoffrey Hinton et al. in [46] in 2006 to learn data features as well as find better data expression
through multi-level structure and layer-wise training. The main idea is to recognize all kinds of
data such as image, text, sound etc. in the way of simulating the human brain to learn things
through a multi-layer network structure. Unlike traditional machine learning methods, deep
learning can integrate feature extraction and categorical regression into one model and thus greatly
reduce the work of artificial feature design.
Deep learning models can be classified into 3 categories based on their structures and applications:
generative deep architectures, discriminative deep architectures and hybrid deep architectures [47].
Generative deep architectures such as Deep Boltzmann Machine (DBM) and Deep Belief Network
(DBN) are used to describe the high-level correlation among data. Category of the observation
samples is obtained through joint probability distribution to better calculate both prior and
posterior probability. Discriminative deep architectures are generally used in classification
problems to describe the posterior probability of data, one typical example is the Convolutional
Neural Networks (CNN). Hybrid deep architectures such as Recurrent Neural Networks (RNN)
combines the features from both generative and discriminative structures. While solving the
23
classification problem, they make full use of the output from the generative architecture to simplify
as well as optimize the whole model.
The superior ability to extract global features and context information from the inherent
characteristics of data makes deep learning the first choice for solving many complicated
problems. Scholars have now carried out remarkable researches in this field and proposed a large
number of efficient models that can be directly used and improved according to the specific needs.
4.2 Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN) is a discriminative classifier developed from multi-layer
perceptron. In other words, given labelled training examples, the algorithm will output the
predicted probabilities of each existed categories for new data. It is designed to recognize specific
patterns directly from image pixels with minimal pre-processing. Due to the ability of proceeding
shift-invariant classification based on its hierarchical structure, CNN is also known as Shift-
Invariant Artificial Neural Networks (SIANN) [48].
The earliest study of CNN starts in 1959 by two neuroscientists, Hubel and Wiesel. During their
experiments on cats’ visual cortex, they found that neurons used for local sensitivity and direction
selection could effectively reduce the complexity of the feedback neural network, and thus
proposed the concept of the receptive field. Later in 1980, inspired by their idea, Japanese scholar,
Kunihiko Fukushima, proposed the neocognitron model [49] which decomposes a complex visual
pattern into many sub-patterns (as known as features) and proceeds with a series of hierarchically
connected feature planes. This model can be regarded as the first implementation of CNN. It
attempts to imitate the visual system so that objects with displacement or slight deformation will
still be recognized. In 1998, a multi-layer CNN structure constructed by Yann LeCun et al. [50],
24
LeNet5, achieved great success in the task of recognizing handwritten digits. Since 2012, CNN
starts to receive great attention from researchers and has been widely applied in the computer
vision field.
With the consistent improvement of the computing power on computer chips, the generalization
of GPU clusters, as well as the appearance of the studying on various optimization theories and
fine-tuning technologies, new CNN models have been continuously proposed with the
improvement in structures and better performance. Figure 4-1 illustrates the development of CNN
models.
Figure 4-1: CNN structural evolution map
4.2.1 Basic CNN Structure
A complete CNN architecture is mainly composed of convolutional layers, pooling layers, and
fully-connected layers, as shown in Figure 4-2.
Hubel &
Wiesel
Neocognitron
LeNet
AlexNet
VGG16 VGG19 MSRANet
Network In
Network
GoogLeNet
Inception
V1, V2
GoogLeNet
Inception
V3, V4
ResNetInception
ResNet
Early Attempts
Dropout,
ReLU
Historical
Breakthrough
Deepening the network
Improving the convolution module
Integration of the 2 lines, with the
ability of training deeper network
and accelerating convergence.
25
Figure 4-2: CNN structure based on LeNet-5
(1) Convolutional Layer
The convolutional layer is the core building block on CNN. The main idea is to extract patterns
found within the local regions of the input image that are common throughout the dataset [51].
Figure 4-3: Process of convolution
As illustrated in Figure 4-3, the convolution operation is processed by sliding the kernel across the
input from left to right, top to bottom to generate the feature maps by performing the following
calculation:
kernel
w2
w3 w4
x1,1 x1,2 x1,3
x2,1 x2,2 x2,3
x3,1 x3,2 x3,3
w1
input output
3*3 input convolved
with a 2*2 kernel
W1X1,1+W2X1,2
+W3X2,1
+W4X2,2
W1X1,2+W2X1,3
+W3X2,2
+W4X2,3
W1X2,1+W2X2,2
+W3X3,1
+W4X3,2
W1X2,2+W2X2,3
+W3X3,2
+W4X3,3
26
(𝑥𝑊)𝑚,𝑛 = ∑ ∑ 𝑥𝑚+𝑥−1,𝑛+𝑦−1𝑊𝑥,𝑦
𝑤
𝑦=1
ℎ
𝑥=1
ℎ𝑖 = 𝜎(𝑥𝑊𝑖 + 𝑏𝑖)
where 𝑊𝑖 is the share weight vector of size 𝑤 × ℎ in layer 𝑖, and 𝑏 is the shared value of bias. ‘’
denotes the convolution operation, that is summing the elementwise products between 𝑊𝑖 and its
corresponding pixel values from the output of layer 𝑖 − 1. σ represents the activation function,
which mainly focuses on adding non-linear factors to the model to better solve more complex
problems. Most commonly used activation functions are Sigmoid, Tanh and Rectified Linear Unit
(ReLU).
In essence, outputs obtained from the convolutional layer is the combination of features extracted
from the receptive field with their relative position remained. These outputs may further be
processed by another layer with higher-level weight vectors to detect larger patterns from the
original image. During the convolution process, the shared weight vector provides a strong
response on short snippets of data with specific patterns.
(2) Pooling Layer
Another important concept in CNN is the pooling layer, which normally been placed after the
convolutional layer and provides a method of non-linear down sampling. It divides the output from
the convolutional layer into disjoint regions and provides a single summary for each region to
obtain the characteristics of convolution.
Classic pooling methods include max pooling, mean pooling and stochastic pooling. Figure 4-4
illustrates its working principles.
27
Figure 4-4: Classic pooling working principles
Max-pooling takes the maximum value in the local receptive region while mean pooling averages
all those values. Stochastic pooling [52] assigns each sample point in the locally receptive region
a probability value and then select a value randomly based on their probabilities. Additional
pooling methods such as adaptive pooling, mixed pooling, spectral pooling, spatial pyramid
pooling, etc. are also proposed based on specific needs [53].
By extracting the desired features from the local area, pooling operation increases the tolerance of
distortion and displacement to improve fault tolerance. Moreover, the use of pooling greatly
reduces the spatial size of data and thus improve the computational efficiency.
(3) Fully Connected Layer
Before getting the classification result, there is usually one or more fully connected layers placed
at the very end of a CNN model. Same as layer structures in neural network, fully connected layer
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
0.15 0.30 0.05 0.20
0.20 0.35 0.35 0.40
0.20 0.20 0.10 0.40
0.20 0.40 0.30 0.20
6 8
2 0
6 8
3 4
3.25 5.25
2.00 2.00
Kernel Size: 2*2 Stride:2
Average Pooling
Stochastic Pooling
ProbabilityMatrix
Max Pooling
28
is composed of a certain number of disconnected neurons, neurons between adjacent layers are
fully connected (see Figure 4-5).
Figure 4-5: Fully connected layer neuron schematic diagram
This fully connected layer structure develops a shallow multi-layer perceptron which aims at
integrating the previously extracted local feature information with categorical discrimination to
classify the input data.
4.2.2 CNN Working Mechanism
CNN uses existing samples and their corresponding labels to train the model so that parameters
such as weights and bias can be adjusted through backpropagation of loss during the training
process to improve the classification accuracy.
The training process is as follows:
(1) Initialized all parameters of the model to smaller random numbers.
(2) Randomly pick n samples from the training data, ((𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑛, 𝑦𝑛)), and feed
them into the model. Where 𝑥𝑖 , 𝑖 = 1,2, … , 𝑛 represents the sample data, 𝑦𝑖 ∈ {1, 2, … , 𝑘},
𝑖 = 1, 2, … , 𝑛 is one of the 𝑘 categories corresponding to the sample data, also known as the
expected label of the 𝑖-th sample.
fully-
connected
layer 1
fully-
connected
layer 2
fully-
connected
layer 3
Fully-Connected Layer Structure
.
.
.
X1
X2
Xn
Bias
Output
W1,j
W2,j
Wn,j
input
(output from
previous layer)
weighted sumactivation
function
Processing of Neurons with Related Parameters
29
(3) Propagate input data forward the model layer by layer and obtain the predicted classification
results given by the model. In CNN, the most commonly used classifier to generate the result
is SoftMax classifier. The calculation is defined as follows:
ℎ𝜃(𝑥𝑖) = [
𝑃(𝑦𝑖 = 1|𝑥𝑖, 𝜃)
𝑃(𝑦𝑖 = 2|𝑥𝑖, 𝜃)⋮
𝑃(𝑦𝑖 = 𝑘|𝑥𝑖 , 𝜃)
] =1
∑ 𝑒𝜃𝑗𝑇𝑥𝑖𝑘
𝑗=1[ 𝑒
𝜃1𝑇𝑥𝑖
𝑒𝜃2𝑇𝑥𝑖
⋮
𝑒𝜃𝑘𝑇𝑥𝑖]
where θ denotes the parameters involved in the CNN model. This equation scales the output
of the resulted 𝑘-dimensional vector to numbers between 0 and 1 with the total sum equals to
1. Therefore, each element of this vector is also known as the probability of the input that
belongs to its corresponding class. Element with the highest probability will be selected and
the class corresponds to this probability will be assigned to the input image as a final decision
of the model.
(4) Compare the predicted class label, �̂�𝑖 , with the expected label 𝑦𝑖 , and calculate the cross-
entropy cost function using the following equation:
J(𝑊, 𝑏) = −1
𝑛∑ [𝑦𝑖𝑙𝑜𝑔�̂�𝑖 + (1 − 𝑦𝑖)𝑙𝑜𝑔(1 − �̂�𝑖)]
𝑛
𝑖=1
where 𝑊 indicates the weights and 𝑏 represents the bias.
For all training samples, if the predicted value is close to the expected value, the cross-entropy
will be close to 0.
(5) Compute the gradients of 𝑊 and 𝑏 of the cost function using the following formulas so that
the weight and bias that contributed most to the loss will be obtained.
𝜕𝐽
𝜕𝑊𝑖=
1
𝑛∑
𝜎′(𝑧)𝑥𝑖
𝜎(𝑧)(1 − 𝜎(𝑧))
𝑛
𝑖=1(𝜎(𝑧) − 𝑦)
𝜕𝐽
𝜕𝑏𝑖=
1
𝑛∑ (𝜎(𝑧) − 𝑦)
𝑛
𝑖=1
30
where 𝜎(𝑧) is the activation function, 𝜎(𝑧) − 𝑦 indicates the error between the predicted and
expected value. Therefore, as the error is getting larger, the gradient will keep increasing and
the parameters will be adjusted at a faster speed.
Once done computing the gradients, weights and biases are updated as follows so that the total
cost decreases.
𝑊𝑖 = 𝑊𝑖 − 𝜂𝜕𝐽
𝜕𝑊
𝑏𝑖 = 𝑏𝑖 − 𝜂𝜕𝐽
𝜕𝑏
𝜂 in the above equation is the learning rate, a hyperparameter that is normally set manually.
High learning rate indicates taking a larger step during the update of parameters and will result
in taking less time for the model converging to an optimal value. However, if the step is too
large, the updated values will jump too far each time so that the result optimal point is not
accurate enough. On the contrary, if the learning rate is too low, it will take the model a long
time to converge. Therefore, the selection of the learning rate should neither be too high nor
too low.
4.2.3 Explicit Training of CNN
During the training session of CNN, the input data volume and parameters updating method are
the main factors that influence the whole process. The most commonly used training methods are
batch training, stochastic gradient descent and mini-batch training.
In batch training, all data will be feed into the model to calculate total gradient increment, this
value will be further used in updating parameters. This method reduces the number of updates of
model parameters and thus decreases the calculation cost of the model as well as shorten the
31
training time. Since each value update involves the gradient from all input data, parameters will
move faster towards the direction in which the cost function drops. Moreover, it helps to avoid
over-fitting caused by training on a small number of samples. However, averaging the entire
samples to get the gradient decreases the impact of changes provided by parameters, and thus more
training epochs is required. This problem is catastrophic for large data sets.
Stochastic gradient descent method feed randomly disordered data one at a time into the model
during training to generate the gradient and update parameters for each data individually.
Compared to batch training, this algorithm greatly reduces the number of training iterations.
However, since each process of updating the parameters relies on a single data sample, the whole
model will tend to better optimize the individual sample rather than the general data and will thus
be easily influenced by the problem of over-fitting.
Mini-batch training is proposed by combining the advantages of both batch training and stochastic
gradient descent. This method divides the randomly disordered data into small batches and
calculates the gradient for each batch to carry out parameter updating. Since the datasets used for
deep learning tend to be relatively large, it has become the most commonly used training method.
4.3 Significance of Applying CNN in Medical Image Recognition
In recent years, with the improvement of the radiology medical equipment and the increasing
number of daily diagnostics, thousands of medical images are produced in hospital every day,
which incredibly increases the workload of film reading doctors.
Traditional computer-aided detection (CAD) system uses machine learning methods such as
support vector machines (SVM), K nearest neighbors (KNN) etc., to help radiologists improve the
diagnostic accuracy. However, most of these methods need to manually extract disease features.
32
With the various and ever-changing features of lesions, features extracted in previous may not
apply to new patient data. Therefore, traditional machine learning methods are not suitable as a
long-term effective solution.
With the ability to extract complex pathological features automatically from the data and the
intrinsic requirement of a large dataset, the application of CNN in various diagnostic modalities
turns to be an efficient solution.
4.4 Advanced CNN Models Used in The Experiment
4.4.1 VGGNet
VGGNet is a deep CNN model proposed by researchers from Visual Geometry Group in Oxford
University and Google’s DeepMind branch which aims at exploring the relationship between the
depth of CNN and its performance.
Compare to classic CNN structures, the most prominent feature of VGGNet is the increasing of
model capacity and complexity by repeatedly stacking small convolution kernels with size 33
and maximum pooling kernels with size 22. Moreover, [54] demonstrated that the superimposing
of multi-level convolution kernels will reduce the number of parameters in the model and thus
reduce the amount of calculation. For example, for a layer that has both the input and output with
C channels, the number of parameters required using a 77 convolution kernel should be
77CC = 49C2. However, with the stacking of the 3-layer convolution kernel, total parameters
needed will be reduced to 333CC = 27C2.
33
With the deeper and more complex structure that better extract visual data representations in a
hierarchical way, VGGNet reduced the number of iterations required for convergence significantly
as well as the error rate.
The VGGNet models used in our experiment are pre-trained VGG16 and VGG19 on ImageNet
Dataset, model structures and parameters are all provided by Torch 0.3.1.post2.
4.4.2 GoogLeNet Inception Model
The GoogLeNet Inception Model was firstly proposed by Christian Szegedy et al. in [55] and later
got improved in [56]. The highlight of this model is the introducing of inception modules with the
dense structure to approximate a sparse matrix, which improves the efficiency by extracting more
features under the same amount of computation.
Before that, CNN models tend to achieve better training results by simply increasing the depth of
the network. However, problems such as overfitting on small datasets and the escalation of
computational complexity are raised as the number of layers keep increasing. The inception
modules (see Figure 4-6) solve these problems via the use of multiple kernels that capture the
receptive fields in various sizes and the use of bottleneck layer with size 11 to shrink the number
of channels.
34
Figure 4-6: Inception model with dimension reduction
The GoogLeNet Inception model used in our experiment is the pre-trained Inception V3 on
ImageNet Dataset, model structures and parameters are all provided by Torch 0.3.1.post2.
4.4.3 ResNet
As the most popular CNN model use in various computer vision fields, ResNet was proposed in
[57] in 2015 with the implementation of residual blocks (see Figure 4-7) in original CNN models,
which has the characteristics of keeping the structure simple but deep as well as increasing the
accuracy.
35
Figure 4-7: Shortcut connection of the residual block
The residual blocks shown above contains two mappings, identity mapping and residual mapping.
By establishing a direct connection between the input and output through these two mappings, the
latter layer only needs to learn new features based on the residuals from the input so that even if
the residual reduces to 0 as the model goes deeper, the identity mapping remains, which keeps the
network stay in optimal state without losing the gradient. This effectively solves the gradient
dispersion problem existed in deep CNN models.
The ResNet models used in our experiment are the pre-trained ResNet34, ResNet50 and
ResNet101 on ImageNet Dataset, model structures and parameters are all provided by Torch
0.3.1.post2.
weight layer
weight layer
ReLU
ReLU
X
Xidentity
F(X)
F(X) + X
36
Chapter 5
TB Detection via Improved CNN Models and Artificial Bee Colony
Fine-Tuning
5.1 Transfer Learning
Transfer learning is a process that focuses on storing the knowledge learned in solving one problem
and applying it to a correlated task [58]. This method aims at leveraging previous learnings and
building accurate models for specific tasks in a more efficient way [59].
In the computer vision field, models with deep and complicated structures are expensive to train
because of the requirement on a large dataset and expensive hardware such as GPU. Moreover, it
takes several weeks or even longer to train a model from scratch. Thus, the use of a pre-trained
model with the developed internal parameters and the well-trained feature extractors helps in
increasing the overall performance of the model for solving similar problems with relatively
smaller datasets.
Considering the advantages of transfer learning, the CNN models that we have been implemented
during the experiment have all been pre-trained on the ImageNet dataset [60], a very large dataset
which contains millions of images in over one thousand categories. Models with the previously
pre-trained parameters are then been modified and trained on CXR datasets for TB diagnosis.
37
5.2 Modifications of Advanced CNN Models
5.2.1 Modifications on CNN Architectures
To further boost the performance of the pre-trained models and better utilize the developed internal
parameters for TB diagnosis, some slight modifications are made on the last few layers of the
original advanced CNN models.
Figure 5-1 presents the diagram of the modifications that are made on the original CNN
architecture.
Figure 5-1: Modifications on CNN architecture
At first, the very last pooling layer has been changed from the default set of either max or average
pooling into the parallel connection of adaptive max and average pooling, an effective pooling
method that is uniquely provided by PyTorch and will automatically control the output size based
on the input parameters. This collection of both maximized and averaged feature maps helps
collect more high-level information learnt from the task dataset which will generate more useful
and comprehensive details for further prediction. After adaptive pooling, two fully connected
CNN
Model
Main Part
Pooling:
Max/Avg
Fully
Connected
LayerOutput
Adaptive
Max
Pooling
Adaptive
Avg
Pooling
Fully
Connected
Layer I
Fully
Connected
Layer II
batch norm
dropout
batch norm
dropout
replaced by
Original Structure
Modified Structure
38
layers are added before the final output as a deep neural network structure to better capture and
organize the high-level information and improve the overall performance. Moreover, batch
normalization [61] is implemented in fully connected layers to improve the efficiency as well as
eliminate the internal covariate shift of the activation values in feature maps so that the distribution
of the activations remains the same during training. Also, dropout is added so that a certain number
of neurons in each layer will be randomly dropped along with all their associated incoming and
outgoing connections during training to avoid overfitting that might be caused by this deepened
and more complicated structure.
5.2.2 Model Division with Different Learning Rates
Model division with different learning rates aims at dividing the layers within a CNN model into
various groups (three in our experiment) and distributing different learning rates for each group
for training. The general idea has been illustrated in Figure 5-2.
Figure 5-2: CNN division with different learning rates
The first few layers of the CNN model mainly focus on extracting generic features (edges, shapes,
textures etc.) that help to identify the basic information exist in every image and thus need very
less training during the transfer learning process. Layers in the middle part of a CNN model will
learn more complex features and specific details concerning the dataset on which it is trained.
Knowledge learned from this stage will have a direct impact on the result of the specific task
smaller learning rate larger learning rate largest optimal learning rate
Output
Group I Group II Group III
A A A B B B C C C
39
related to the training dataset with a slightly higher learning rate. The last few layers organize all
the features extracted from the previous layers to recognize different target objects (animals,
vehicles, tumors etc.) as so to generate the final result. Since these layers have the most correlation
to the target training dataset and will trade the previous feature maps for something more aligned
to the target task, this is where we would like the bulk of the training to happen. Therefore, this
part of a CNN model is where we set the learning rate to be the highest among all three parts.
In general, during transfer learning, model division with different learning rates will help CNN
models better adapt to the target problem since it has already learnt something more general and
thus will perform well on the new problem effectively and efficiently. As for how less or more the
learning rates for the first and middle part of a CNN model depends on the data correlation between
the pre-trained model and target model.
5.3 Fine-Tuning the CNN Model via Artificial Bee Colony Algorithm
5.3.1 Artificial Bee Colony
Artificial Bee Colony (ABC) is a metaheuristic algorithm that was proposed by Karaboga in [62]
in 2005. It was inspired by the foraging behaviour of honeybees and has been abstracted into a
mathematical model to solve multidimensional optimization problems [63]. This algorithm
represents solutions in multi-dimensional search space as food sources (nectar) and maintains a
population of three types of bee (scout, employed, and onlooker) to search for the best food source
[64].
In the early stage of food-collecting, scout bees go out to find food sources by either exploring
with the prior knowledge or random search. The scout bee will then turn into an employed bee
after finishing the searching task. Employed bees are mainly in charge of locating the food source
40
and collecting nectar back to the hive. After that, based on specific requirements, they will proceed
with the selection from continuing collecting nectar, dancing to attract more peers to help or give
up the current food source and then change their roles to scout or onlooker bees. Onlooker bees’
job is to decide whether participating in nectar collection based on the dance performed by
employed bees.
Assume that the problem domain has a dimension of D, the position of a food source, 𝑖, which is
the parameter that needs to be optimized, will be represented as:
𝑋𝑖 = [𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝐷], 𝑥𝑖𝑑 ∊ (𝐿𝑑 , 𝑈𝑑)
where 𝐿𝑑 and 𝑈𝑑 represents the lower and upper limits of the search space respectively, and 𝑑 =
1,2,3, … , 𝐷 indicates the dimension number.
The location of the food source will be initialized according to the following equation:
𝑥𝑖𝑑 = 𝐿𝑑 + 𝑟𝑎𝑛𝑑(0,1)(𝑈𝑑 − 𝐿𝑑)
During the searching step, nearby food locations, 𝑉𝑖 = [𝑣𝑖1, 𝑣𝑖2, … , 𝑣𝑖𝐷], will be generated around
the existed food source by employed bees’ exploration via:
𝑣𝑖𝑑 = 𝑥𝑖𝑑 + 𝜑(𝑥𝑖𝑑 − 𝑥𝑗𝑑), 𝑗 ≠ 𝑖
where 𝜑 is a random number that lays in the interval [0, 1] with a uniform distribution.
Fitness value will then be obtained by performing the non-linear transformation of the target
function 𝑓𝑖(𝑋𝑖) as follows, to help decide whether to replace the original food source 𝑋𝑖 with the
newly explored location 𝑉𝑖 through greedy selection:
41
𝑓𝑖𝑡𝑖(𝑋𝑖) = {
1
(1 + 𝑓𝑖(𝑋𝑖)), 𝑖𝑓 𝑓𝑖(𝑋𝑖) ≥ 0
1 + 𝑎𝑏𝑠(𝑓𝑖(𝑋𝑖)), 𝑖𝑓 𝑓𝑖(𝑋𝑖) < 0
After that, employed bees will fly back to the hive and share the information on the food source.
Onlooker bees will then decide whether to follow the employed bee and collect food based on the
calculated probability:
𝑝𝑖 = 𝑓𝑖𝑡𝑖
∑ 𝑓𝑖𝑡𝑖𝑁𝑃𝑖=1
⁄
where 𝑁𝑃 indicates the total number of the discovered food source.
The selection of employed bee is executed via roulette, during which a random number between 0
and 1 is generated, if the number is less than 𝑝𝑖, the onlooker bee will follow an employed bee by
repeating its job on generating a new food source around 𝑋𝑖 and then proceed with the greedy
selection.
If the number of rounds in which a food source is mined reaches to the threshold after 𝑡 iterations,
this food source will be abandoned. The employed bees will then turn into scout bees and continue
exploring new food sources to replace the old one.
Figure 5-3 presents a flowchart of how the ABC algorithm works to solve optimization problems.
42
Figure 5-3: Flowchart of artificial bee colony algorithm
Scout bees
searching for initial
food source
No
Yes
Start
Exist food source
that reaches to maximum
search limit
Initialize bee
population
Employed bees
searching for the
nearby food source
Selection of
employed bees
based on probability
Onlooker bees turn into the
selected employed bees and
keep exploring
Searching for new
food sources
Reach to the
maximum iteration
or minimum error
End
Yes
No
43
5.3.2 CNN Fine-Tuning via Artificial Bee Colony
Consider the advantage of ABC algorithm in obtaining the globalized optimal solution, it has been
used in our experiment to fine-tune the fully connected layers of trained CNN models on CXR
datasets to improve the general performance.
The whole fine-tuning process can be regarded as searching for the appropriate parameters that
could further minimize the total loss in the CNN model. Start with randomly generated solutions,
ABC keeps looking for better solutions by searching the nearby regions of the current best solution
and abandoning the poor solutions.
In the beginning, a solution vector that contains a specific number of possible solutions is created.
To make full use of the previous training results, the first element of the solution vector is
initialized with weights and bias that are obtained from the trained CNN model. The rest elements
are generated nearby the obtained weights and bias in the given space by multiplying the first
solution vector with a random number between 0 and 1:
𝑠𝑜𝑙_𝑣𝑒𝑐 = (𝑤(𝑡)1, 𝑤(𝑡)2, … , 𝑤(𝑡)𝑛)
𝑤(0)1 = (𝑛𝑛.𝑊, 𝑛𝑛. 𝑏)
𝑤(0)𝑖 = 𝑟𝑎𝑛𝑑(0, 1)𝑤(0)1, 𝑖 = 2,3, … , 𝑛
where 𝑡 represent the total number of iterations needed during the whole fine-tuning process.
The generalization of multiple solutions will not only take advantage of the parameters from the
trained model but also avoid the model from falling into local optimal points during optimization.
Searching for the nearby solutions will then be started based on the initialized vectors:
44
𝑔𝑒𝑛_𝑣𝑒𝑐 = (𝑣(𝑡)1, 𝑣(𝑡)2, … , 𝑣(𝑡)𝑛)
𝑣(𝑘)𝑖 = 𝑤(𝑘 − 1)1 + 𝛷𝑖(𝑤(𝑘 − 1)𝑖 − 𝑤(𝑘 − 1)𝑗), 𝑖 ≠ 𝑗
where 𝑘 represents the 𝑘-th iteration of the optimization process.
Once the new solution is found nearby the initialized one, fitness value which measures the quality
of solutions will be computed for the comparison between old and new solutions according to the
following equation:
𝑓𝑖𝑡(𝑤(𝑘)𝑖) =1
1 + 𝐸(𝑤(𝑘)𝑖)
𝐸(𝑤(𝑘)𝑖) in the above equation is the loss function of the CNN model, which is the target function
that need to be optimized and will always result in a non-negative value. The loss function used in
our experiment is the cross-entropy loss:
𝐸(𝑤(𝑘)𝑖) = −1
𝑛∑ [𝑦𝑖 ln(𝑜(𝑘)𝑖) + (1 − 𝑦𝑖) ln(1 − 𝑜(𝑘)𝑖)]
𝑛
𝑖=1
where 𝑦𝑖 represents the expected output of the 𝑖-th sample within the training batch and 𝑜(𝑘)𝑖 is
the actual output of this sample from the 𝑘-th iteration.
The selection of a better solution will be proceeded based on the calculated probability of the
fitness values:
𝑝(𝑘)𝑖 =𝑓𝑖𝑡(𝑤(𝑘)𝑖)
∑ 𝑓𝑖𝑡(𝑤(𝑘)𝑖)𝑖
According to the above equations, for each generated solution, the smaller the loss, the larger the
fitness value, and there will be a greater probability to be selected as the final solution.
45
Figure 5-4 presents a flowchart of how the ABC algorithm works on fine-tuning a trained CNN
model.
Figure 5-4: Fine-tune the trained CNN model via artificial bee colony algorithm
ABC parameters
initialization
Yes
Start
Reach to the
maximum iteration
or minimum error
CNN model
pre-training
Global exploration of
parameters via ABC
Output of optimized
parameters
CNN model fine-tuning with
the optimized parameters
End
No
46
5.4 Experiment Settings
5.4.1 Experiment Descriptions
In our experiment, binary classification of CXR images for TB diagnosis which identifies the lung
abnormalities are performed by six different CNN models (VGG16, VGG19, Inception V3,
ResNet34, ResNet50 and ResNet101) on three open-source CXR datasets (Montgomery County
Chest X-Ray dataset, Shenzhen Hospital Chest X-Ray dataset and NIH Chest X-Ray8 dataset)
respectively.
For each CNN model that performs classification on each dataset, there are 3 stages included:
training with the original CNN architecture, training with the modified CNN architecture as well
as the differential learning rates, fine-tuning the trained modified CNN model via ABC.
At the first stage, all parameters within the CNN model that has the original architecture will be
frozen except for the last layer and the last layer will be trained on the target CXR dataset for 3
epochs with a learning rate equals to 1e-3. Since the CNN models used in our experiment are all
pre-trained on the ImageNet dataset for the classification of daily objects in 1000 categories,
according to transfer learning, features learned from previous layers must eventually transition
from general to specific by the last layer of the model that has a direct influence to the final output,
training the last layer on a new dataset for a few epochs at the beginning will reduce the time for
the model to converge so as to perform a new task. Then, parameters of the CNN model with its
last layer been precomputed will be unfrozen and the entire model will be trained on the target
datasets for 12 epochs with a learning rate equals to 1e-4. During the second stage, the last few
layers of the original CNN model are modified according to 5.2.1. These modified layers will later
be trained on the target CXR dataset for 3 epochs with a learning rate equals to 1e-3. After the first
47
step of training, divide the modified CNN model into 3 parts, unfreeze all parameters and train the
entire model on the target datasets for 12 epochs with different learning rates (1e-3/25, 1e-3/5 and
1e-3) that have been assigned to each part.
In the above 2 stages, during the whole training process that contains the fixed number of epochs,
models that achieve the highest classification accuracy on the validation set will be saved for
further evaluation. At the last stage, fully connected layers of the trained CNN models with
modified structures will be fine-tuned via the ABC algorithm to further improve the model’s
overall performance.
Moreover, the multi-class classification will be performed following the same procedure for the
further diagnosis of specific TB manifestations (consolidation, effusion, fibrosis, infiltration, mass,
nodule and pleural thickening) from the CXR images in NIH Chest X-Ray Dataset.
5.4.2 Ratio Comparison
The ratio comparison method is used in our experiment to compare the performance of different
CNN models on each dataset.
Since the training process of a CNN model has been designed to constantly adjust parameters
according to data from the training set and evaluate the performance on data from the validation
set so as to get a final objective idea of how the model might perform on unseen data and determine
whether to continue tuning the parameters, the existence of a testing set that is independent of sets
that involved in the training process is necessary to further measure the performance of a trained
model in an unbiased way. Therefore, for each classification task, a certain amount of CXR images
from the dataset will be set for testing the final performance of a trained CNN model and the rest
will be split into train-valid sets with the ratio of 90%-10%, 80%-20% and 70%-30% respectively.
48
It is worth noting that images reserved for testing purposes in each dataset remains the same
regardless of the variations in train-valid distributions for parallel comparisons of the trained CNN
models on the same dataset.
5.4.3 Parameters Settings
Our experiment is implemented in Python 3.5 with two deep learning libraries, PyTorch and
FastAI. The whole process runs on Windows 10 operating system with the following hardware
deployments:
Table 5-1: Hardware Deployments
CPU
Model Intel Xeon E5-2623 V4
Base Frequency 2.60GHz
# of Cores 8
GPU
Model NVidia Quadro P4000
GPU Memory 8GB GDDR5
NVidia CUDA Cores 1792
RAM 30GB
SSD 250GB
Detailed separation of CXRs with different train-valid ratios within each dataset for the basic
abnormality detections are given in Table 5-2.
Table 5-2: Image Separations for Abnormality Diagnosis
CXR Dataset
Train/Valid Ratio =
9:1
Train/Valid Ratio =
8:2
Train/Valid Ratio =
7:3 9:1 / 8:2 / 7:3
Training
Set
Valid
Set
Training
Set
Valid
Set
Training
Set
Valid
Set
Testing
Set
Montgomery
County
CXR Dataset
Normal 65 10 60 15 52 23 5
Abnormal
with TB 50 5 45 10 38 17 3
Shenzhen
Hospital
CXR Dataset
Normal 270 30 240 60 210 90 6
Abnormal
with TB 285 35 250 70 225 95 16
Chest X-
Ray8 Dataset
Normal 29250 3250 26000 6500 22750 9750 1831
Abnormal
with TB 18760 2090 16680 4170 14590 6260 464
49
The original number of CXRs in NIH Chest X-Ray8 Dataset for specific TB manifestations
diagnosis is given in Table 5-3.
Table 5-3: Original CXR Distribution with Specific TB Manifestations in Chest X-Ray8
Consolidation Effusion Fibrosis Infiltration Mass Nodule Pleural
Thickening
324 2035 641 5133 1313 1888 851
Since the distribution of CXRs under each TB manifestations class presents a strongly biased trend,
models trained on this dataset will perform a strong preference for their predictions. Therefore,
data augmentation has been implemented to increase the number of images under the classes with
fewer CXRs and thereby create an evenly distributed data to eliminate the interference.
In our experiment, common augmentation methods such as horizontal flip, rotate, contrast
adjustment and position translation are used to generate new images. After data augmentation, the
distribution of CXRs in NIH Chest X-Ray8 Dataset for TB manifestations diagnosis is presented
in Table 5-4. It is worth noting that images in the testing set have not been augmented to ensure
the quality of the results during the testing of the models’ performance on transfer learning.
Table 5-4: Augmented CXR Distribution in Chest X-Ray8 for TB Manifestations Diagnosis
TB Manifestations
Train/Valid Ratio =
9:1
Train/Valid Ratio =
8:2
Train/Valid Ratio =
7:3 9:1 / 8:2 / 7:3
Training
Set
Valid
Set
Training
Set
Valid
Set
Training
Set
Valid
Set
Testing
Set
Consolidation 4460 500 3970 990 3470 1490 14
Effusion 4500 500 4000 1000 3500 1500 85
Fibrosis 4460 500 3970 990 3470 1490 21
Infiltration 4500 500 4000 1000 3500 1500 133
Mass 4500 500 4000 1000 3500 1500 63
Nodule 4500 500 4000 1000 3500 1500 88
Pleural Thickening 4500 500 3980 1000 3480 1500 21
50
5.5 Results and Discussion
5.5.1 CNN Modification and Division with Different Learning Rates
We applied some modifications on the original CNN models by changing the structures of the last
few layers, separating the model into 3 parts and assigning each part with a different learning rate
during the training process to improve the model’s performance on the provided CXR datasets.
Accuracies of the validation set achieved by the six different CNN models introduced in chapter 4
and their modified structures during each training epoch on the three public CXR datasets with
different train/valid ratios for TB diagnosis are given respectively. Averaged accuracy of all six
CNN models has also been calculated at each stage to further compare and analyze the overall
trend on the same dataset with different train/valid ratios.
Table 5-5, Table 5-6 and Table 5-7 present the validation accuracy on Montgomery County CXR
Dataset, the smallest dataset among all three datasets used in our experiment. Among all raw
models, VGG19 and ResNet50 have the best and the most stable performance by achieving 90%,
88% and 93.33% of accuracy on the valid set with train/valid ratio equals to 7:3, 8:2, 9:1
respectively. As for the modified models, VGG19 and ResNet50 maintains their stability and
further improve their accuracy up to 97.5% for VGG19 as well as 95% for ResNet50, 96% and
100% with train/valid ratio equal to 7:3, 8:2, 9:1 respectively. The highest accuracies achieved by
other raw and modified CNN models is less than or equal to the ones achieved by VGG19 and
ResNet50, their general performance also varies with different train/valid ratios. During the whole
training process, modified CNN models tend to have a better performance with an obvious
increasing on accuracy for every epoch as compare to the raw models.
51
Figure 5-5 illustrates an overall performance of TB diagnosis for both raw and modified CNN
models on Montgomery County CXR Dataset with train/valid ratio equals to 7:3, 8:2, 9:1
respectively using the averaged accuracies among all six models. The given line chart shows an
obvious improvement of the accuracy provided by the modified CNN models among all train/valid
ratio cases compares to the raw models. Moreover, both raw and modified models tend to have the
best accuracy with the train/valid ratio equals to 9:1. As for the dataset with train/valid ratio equal
to 7:3 and 8:2, the nearly overlapped trending line indicates a similar performances of CNN models
in both raw and modified cases, but still models that are trained with train/valid ratio equals to 8:2
has a slightly better performance than those with 7:3 in general.
Table 5-5: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on Montgomery County
CXR Dataset for Abnormality Detection
CNN
Model
Train/Valid = 7:3
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 65.00 70.00 67.50 72.50 75.00 80.00 82.50 77.50 82.50 80.00 82.50 77.50
VGG19 67.50 65.00 72.50 70.00 75.00 82.50 80.00 82.50 85.00 90.00 87.50 85.00
Incep V3 65.00 67.50 72.50 72.50 77.50 75.00 80.00 85.00 82.50 85.00 80.00 82.50
ResNet34 72.50 70.00 75.00 80.00 80.00 85.00 77.50 85.00 87.50 82.50 87.50 85.00
ResNet50 70.00 75.00 80.00 77.50 85.00 82.50 82.50 80.00 87.50 85.00 90.00 82.50
ResNet101 70.00 75.00 77.50 72.50 80.00 82.50 80.00 87.50 85.00 80.00 87.50 85.00
Average 68.30 70.42 74.17 74.17 78.75 81.25 80.42 82.92 85.00 83.75 85.80 82.92
Modi
fied
VGG16 70.00 75.00 82.50 87.50 82.50 85.00 82.50 90.00 85.00 87.50 92.50 85.00
VGG19 77.50 85.00 87.50 82.50 77.50 82.50 90.00 92.50 97.50 92.50 95.00 95.00
Incep V3 75.00 72.50 80.00 75.00 80.00 82.50 82.50 85.00 87.50 85.00 90.00 87.50
ResNet34 75.00 82.50 87.50 92.50 90.00 95.00 92.50 87.50 90.00 82.50 85.00 87.50
ResNet50 72.50 72.50 77.50 80.00 87.50 90.00 87.50 85.00 90.00 95.00 90.00 92.50
ResNet101 75.00 80.00 77.50 80.00 82.50 87.50 90.00 82.50 85.00 92.50 95.00 90.00
Average 74.17 77.92 82.08 82.92 83.33 87.08 87.50 87.08 89.17 89.17 91.25 89.58
52
Table 5-6: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=8:2 on Montgomery County
CXR Dataset for Abnormality Detection
CNN
Model
Train/Valid = 8:2
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 68.00 76.00 84.00 80.00 80.00 76.00 80.00 76.00 84.00 84.00 80.00 80.00
VGG19 72.00 68.00 76.00 80.00 84.00 80.00 84.00 80.00 88.00 84.00 80.00 84.00
Incep V3 68.00 72.00 80.00 76.00 72.00 80.00 84.00 80.00 80.00 80.00 84.00 84.00
ResNet34 76.00 80.00 72.00 76.00 80.00 84.00 76.00 88.00 84.00 80.00 88.00 84.00
ResNet50 72.00 76.00 80.00 76.00 84.00 84.00 80.00 80.00 84.00 88.00 84.00 88.00
ResNet101 76.00 72.00 80.00 84.00 80.00 88.00 80.00 84.00 88.00 80.00 88.00 84.00
Average 72.00 74.00 78.67 78.67 80.00 82.00 80.67 81.33 84.67 82.67 84.00 84.00
Modified
VGG16 76.00 80.00 76.00 84.00 88.00 84.00 80.00 88.00 84.00 88.00 92.00 88.00
VGG19 76.00 88.00 92.00 88.00 88.00 96.00 92.00 88.00 88.00 92.00 96.00 96.00
Incep V3 72.00 76.00 76.00 80.00 80.00 84.00 88.00 92.00 92.00 84.00 84.00 88.00
ResNet34 80.00 88.00 88.00 88.00 84.00 80.00 84.00 88.00 92.00 84.00 88.00 92.00
ResNet50 84.00 76.00 80.00 84.00 84.00 88.00 88.00 76.00 92.00 92.00 96.00 92.00
ResNet101 80.00 84.00 76.00 88.00 92.00 88.00 84.00 88.00 84.00 92.00 96.00 88.00
Average 78.00 82.00 81.33 85.33 86.00 86.67 86.00 86.67 88.67 88.67 92.00 90.67
Table 5-7: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=9:1 on Montgomery County
CXR Dataset for Abnormality Detection
CNN
Model
Train/Valid = 9:1
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 60.00 73.33 73.33 80.00 73.33 80.00 80.00 73.33 86.67 80.00 86.67 80.00
VGG19 66.67 73.33 73.33 80.00 86.67 93.33 80.00 86.67 86.67 93.33 80.00 86.67
Incep V3 60.00 66.67 66.67 73.33 80.00 80.00 86.67 86.67 73.33 80.00 86.67 80.00
ResNet34 73.33 66.67 73.33 80.00 86.67 86.67 73.33 80.00 80.00 86.67 80.00 86.67
ResNet50 66.67 80.00 73.33 86.67 80.00 86.67 73.33 80.00 93.33 86.67 93.33 86.67
ResNet101 73.33 66.67 80.00 86.67 73.33 80.00 86.67 93.33 86.67 93.33 80.00 86.67
Average 66.67 71.10 73.30 81.11 80.00 84.45 80.00 83.33 84.45 86.67 84.45 84.45
Modified
VGG16 86.67 86.67 80.00 93.33 93.33 93.33 86.67 86.67 80.00 86.67 93.33 86.67
VGG19 93.33 86.67 93.33 80.00 93.33 86.67 80.00 93.33 93.33 100 86.67 100
Incep V3 73.33 66.67 80.00 73.33 86.67 80.00 93.33 86.67 86.67 80.00 93.33 86.67
ResNet34 80.00 86.67 93.33 93.33 80.00 86.67 86.67 86.67 93.33 93.33 93.33 93.33
ResNet50 86.67 80 86.67 86.67 93.33 93.33 100 93.33 86.67 93.33 93.33 100
ResNet101 86.67 93.33 86.67 80 93.33 100 86.67 93.33 100 86.67 86.67 93.33
Average 84.45 83.34 86.67 84.44 90.00 90.00 88.89 90.00 90.00 90.00 91.11 93.33
53
Figure 5-5: Averaged Accuracy Comparison on Montgomery County CXR Dataset for Abnormality Detection
Table 5-8, Table 5-9 and Table 5-10 give the validation accuracy for TB diagnosis via different
CNN models on Shenzhen Hospital CXR Dataset. During the training of both raw and modified
CNN models, the residual CNN series, ResNet34, ResNet50 and ResNet101, provided a stable and
outstanding performance by achieving the top three accuracies on the validation set with train/valid
ratio equal to 7:3, 8:2, 9:1 respectively. Moreover, modified CNN models also tend to have a
generally higher accuracy over the raw models.
The line chart from Figure 5-6 illustrates a generally better performance in TB diagnosis provided
modified models as compared to the raw models among all three train/valid ratio cases on
Shenzhen Hospital CXR Dataset. Besides, for both raw and modified models, the overall accuracy
on the validation set with train/valid ratio equals to 9:1 is better than those with 8:2 and then 7:3
in general.
54
Table 5-8: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on Shenzhen Hospital CXR
Dataset for Abnormality Detection
CNN
Model
Train/Valid = 7:3
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 61.62 70.27 78.38 82.70 83.78 81.62 85.95 87.03 84.86 87.03 85.41 85.41
VGG19 70.81 76.76 80.54 84.32 82.16 83.78 80.54 86.49 85.41 86.49 87.03 85.41
Incep V3 74.05 71.35 81.08 80.00 83.24 84.32 84.86 87.03 86.49 87.03 84.32 85.95
ResNet34 77.30 76.22 80 82.16 85.95 87.57 85.41 88.11 87.03 87.57 86.49 84.32
ResNet50 74.59 78.38 81.62 83.78 86.49 81.08 84.32 83.78 87.03 85.95 87.03 86.49
ResNet101 74.59 80 82.70 83.24 85.41 80 86.49 84.32 88.11 87.57 86.49 87.03
Average 72.16 75.50 80.72 82.70 84.51 83.00 84.60 86.13 86.49 86.94 86.13 85.77
Modified
VGG16 77.84 80.00 82.70 89.73 87.57 88.11 83.24 85.41 87.03 84.32 88.11 87.03
VGG19 79.46 82.70 85.41 82.16 88.11 89.73 87.03 90.27 91.35 90.81 91.89 91.89
Incep V3 77.84 82.70 80.00 83.24 88.65 87.57 89.73 91.35 90.27 91.35 89.73 89.73
ResNet34 80.00 76.76 86.49 83.78 87.03 90.81 91.89 90.81 91.35 88.65 90.27 90.81
ResNet50 81.62 85.41 87.57 88.11 87.03 77.84 90.81 90.27 91.35 89.73 89.19 90.27
ResNet101 80.00 82.16 83.24 83.78 86.49 84.86 88.65 91.89 90.81 91.35 89.73 90.81
Average 79.46 81.62 84.24 85.13 87.48 86.49 88.56 90.00 90.36 89.37 89.82 90.09
Table 5-9: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=8:2 on Shenzhen Hospital CXR
Dataset for Abnormality Detection
CNN
Model
Train/Valid = 8:2
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 75.38 80.00 79.23 81.54 83.08 86.15 85.38 88.46 86.92 89.23 87.69 86.92
VGG19 78.46 80.77 80.00 83.85 85.38 87.69 89.23 88.46 86.92 89.23 87.69 87.69
Incep V3 75.38 82.31 83.08 80.00 84.62 86.92 88.46 90.00 88.46 90.77 90.00 87.69
ResNet34 80.77 82.31 81.54 84.62 86.15 88.46 90.00 89.23 91.54 88.46 91.54 90.77
ResNet50 80.00 81.54 83.85 86.92 83.85 87.69 86.92 90.00 90.77 86.92 90.00 90.77
ResNet101 79.23 81.54 82.31 83.85 87.69 84.62 88.46 90.77 89.23 91.54 90.77 91.54
Average 78.20 81.41 81.67 83.46 85.13 86.92 88.08 89.49 88.97 89.36 89.62 89.23
Modified
VGG16 81.54 82.31 83.08 84.62 85.38 90.00 89.23 91.54 88.46 92.31 90.77 90.00
VGG19 82.31 83.08 85.38 87.69 84.62 88.46 91.54 89.23 92.31 87.69 90.77 91.54
Incep V3 83.85 86.15 88.46 87.69 85.38 89.23 90.77 92.31 92.31 90.00 91.54 89.23
ResNet34 81.54 86.15 84.62 87.69 90.77 93.85 92.31 94.62 95.38 89.23 91.54 93.08
ResNet50 80.77 85.38 86.92 88.46 88.46 91.54 90.00 93.85 92.31 94.62 89.23 90.77
ResNet101 84.62 88.46 83.85 89.23 91.54 93.08 93.85 96.15 94.62 95.38 91.54 92.31
Average 82.44 85.26 85.39 87.56 87.69 91.03 91.28 92.95 92.57 91.54 90.90 91.16
55
Table 5-10: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=9:1 on Shenzhen Hospital CXR
Dataset for Abnormality Detection
CNN
Model
Train/Valid = 9:1
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 76.92 78.46 83.08 86.15 83.08 87.69 87.69 86.15 84.62 89.23 90.77 87.69
VGG19 79.23 81.54 82.31 86.15 87.69 84.62 89.23 90.77 87.69 86.15 89.23 89.23
Incep V3 78.46 81.54 84.62 83.08 86.15 87.69 89.23 90.77 89.23 90.77 87.69 89.23
ResNet34 81.54 83.08 84.62 82.31 87.69 86.15 89.23 90.77 90.77 92.31 89.23 90.77
ResNet50 80.77 83.08 86.15 84.62 84.62 87.69 89.23 90.77 89.23 92.31 90.77 92.31
ResNet101 81.54 80.00 84.62 86.15 87.69 89.23 88.46 92.31 90.77 93.85 92.31 90.77
Average 79.74 81.28 84.23 84.74 86.15 87.18 88.85 90.26 88.72 90.77 90.00 90.00
Modified
VGG16 82.31 83.85 84.62 88.46 86.15 87.69 89.23 89.23 90.77 90.77 92.31 90.77
VGG19 83.08 86.15 84.62 89.23 88.46 87.69 90.77 93.85 92.31 89.23 91.54 92.31
Incep V3 83.08 85.38 87.69 86.15 87.69 88.46 89.23 90.77 91.54 93.08 91.54 92.31
ResNet34 83.08 87.69 89.23 90.00 88.46 91.54 93.85 95.38 92.31 92.31 95.38 93.85
ResNet50 84.62 86.15 87.69 89.23 91.54 93.85 88.46 92.31 95.38 93.85 90.77 92.31
ResNet101 83.08 86.15 87.69 90.77 89.23 92.31 93.85 90.77 95.38 96.92 92.31 95.38
Average 83.21 85.90 86.92 88.97 88.59 90.26 90.90 92.05 92.95 92.69 92.31 92.82
Figure 5-6: Averaged Accuracy Comparison on Shenzhen Hospital CXR Dataset for Abnormality Detection
56
Table 5-11, Table 5-12 and Table 5-13 give the validation accuracy for TB diagnosis via different
CNN models on NIH Chest X-Ray8 Dataset with different train/valid ratios respectively. During
the training process of both raw and modified CNN models, the validation accuracy increases with
less oscillation compare to ones that have been trained on the previous two smaller datasets. In
general, the residual CNN series presents a relatively stable and outstanding accuracy compare to
other models in both raw and modified cases.
Figure 5-7 illustrates a parallel comparison of the averaged validation accuracy on NIH Chest X-
Ray8 Dataset with different train/valid ratios in both raw and modified cases. Each mild upward
line in the figure indicates a smooth increasing of the validation accuracy during training in each
case. Same as the cases of the previous two datasets, the modified models have a significant
improvement on accuracy. Moreover, the overall accuracy on the validation set with train/valid
ratio equals to 9:1 is better than those with 8:2 and then 7:3 in general.
Table 5-11: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on NIH Chest X-Ray8
Dataset for Abnormality Detection
CNN
Model
Train/Valid = 7:3
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 72.50 74.77 75.33 76.38 76.98 77.26 77.61 78.93 79.17 78.20 79.18 80.31
VGG19 76.16 76.94 77.84 78.54 78.36 79.06 79.91 80.46 80.77 79.95 80.62 81.17
Incep V3 77.11 78.06 78.44 79.35 79.63 79.58 79.87 79.78 80.74 81.34 81.64 81.36
ResNet34 77.46 77.95 79.23 79.76 80.15 80.64 80.62 81.79 82.24 82.54 82.40 82.48
ResNet50 75.06 76.25 77.18 77.48 78.41 79.50 79.94 80.21 80.58 80.50 79.52 80.92
ResNet101 72.64 73.77 74.70 77.45 77.58 79.76 80.29 80.73 81.53 81.88 82.20 81.17
Average 75.16 76.29 77.12 78.16 78.52 79.30 79.71 80.32 80.84 80.74 80.93 81.24
Modified
VGG16 78.93 79.36 80.26 82.24 82.48 83.24 84.45 85.33 86.57 86.61 87.29 87.07
VGG19 85.33 86.75 87.14 85.95 84.88 86.99 86.57 87.29 87.17 87.68 86.33 87.01
Incep V3 82.48 83.53 84.43 85.72 86.12 86.85 86.71 87.46 87.20 87.98 87.98 87.17
ResNet34 83.75 86.85 87.24 88.00 86.84 87.41 87.25 87.98 87.66 88.19 87.95 88.06
ResNet50 84.64 84.92 85.73 86.57 86.99 86.50 87.91 87.87 88.03 87.76 87.95 87.17
ResNet101 87.09 85.61 87.58 86.34 84.68 87.88 88.44 86.91 86.18 87.65 87.54 88.23
Average 83.70 84.50 85.40 85.80 85.33 86.48 86.89 87.14 87.14 87.65 87.51 87.45
57
Table 5-12: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=8:2 on NIH Chest X-Ray8
Dataset for Abnormality Detection
CNN
Model
Train/Valid = 8:2
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 78.80 79.98 80.05 80.86 80.62 81.01 81.47 81.79 82.49 82.91 83.09 83.06
VGG19 77.39 78.34 79.28 79.63 79.92 80.97 80.14 80.70 80.86 81.96 82.54 82.66
Incep V3 78.75 79.75 80.20 81.15 81.32 81.19 82.36 83.69 84.19 84.09 84.69 85.38
ResNet34 81.20 80.13 80.98 81.40 80.54 82.68 82.86 83.69 84.19 84.09 84.69 85.38
ResNet50 77.64 78.80 79.63 80.34 79.37 80.97 81.70 82.06 82.35 83.35 82.76 81.89
ResNet101 74.97 75.94 77.20 77.68 78.93 79.48 81.49 81.53 81.74 83.40 83.53 83.81
Average 78.13 78.82 79.56 80.18 80.12 81.05 81.67 82.24 82.64 83.30 83.55 83.70
Modified
VGG16 80.84 81.60 82.00 82.32 84.21 84.93 87.01 88.49 89.79 90.22 89.53 89.07
VGG19 81.38 81.90 82.38 83.79 85.07 85.87 87.17 88.08 89.48 90.38 89.73 90.28
Incep V3 82.16 83.74 84.25 84.87 86.57 87.53 88.54 88.40 89.24 90.14 91.09 90.62
ResNet34 82.49 85.67 87.28 88.05 89.35 88.85 89.60 89.75 90.63 91.31 90.82 91.26
ResNet50 82.16 83.04 83.60 84.78 87.82 88.37 88.83 89.86 90.39 90.23 90.53 89.71
ResNet101 83.86 84.70 86.86 87.72 88.81 89.87 89.35 90.15 89.74 90.31 90.87 90.15
Average 82.15 83.44 84.40 85.26 86.97 87.57 88.42 89.12 89.88 90.43 90.43 90.18
Table 5-13: Valid Accuracy for Each Epoch During Training with Train/Valid Ratio=9:1 on NIH Chest X-Ray8
Dataset
CNN
Model
Train/Valid = 9:1
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 81.93 83.58 83.88 85.43 84.98 85.15 86.85 87.39 86.76 87.88 88.16 87.34
VGG19 81.17 82.79 83.50 84.57 84.79 84.98 85.99 85.94 86.37 87.36 87.34 86.61
Incep V3 81.01 82.85 82.79 83.87 84.10 85.19 86.18 86.31 86.20 88.24 88.54 88.07
ResNet34 82.25 83.22 84.63 84.79 85.19 86.05 86.74 86.24 86.10 86.29 87.32 88.07
ResNet50 79.87 81.87 82.68 85.02 86.03 87.32 87.58 88.50 88.63 88.80 89.78 89.08
ResNet101 77.47 79.44 81.24 82.58 82.90 85.58 80.82 85.13 88.35 88.99 88.71 88.48
Average 80.62 82.29 83.12 84.38 84.67 85.71 85.69 86.59 87.07 87.93 88.31 87.94
Modified
VGG16 84.81 88.56 88.07 90.51 90.75 91.52 91.39 90.39 92.02 93.54 93.69 92.15
VGG19 88.40 89.53 88.31 90.34 91.05 89.98 88.16 90.97 91.01 92.17 93.22 91.22
Incep V3 89.66 88.07 90.52 91.33 90.88 91.93 92.47 92.15 93.60 92.32 91.18 93.26
ResNet34 89.55 88.84 90.73 89.83 91.27 90.69 91.76 92.79 93.86 93.65 92.59 92.37
ResNet50 90.19 91.27 90.77 92.30 92.94 93.46 92.92 93.91 94.38 94.23 94.42 93.22
ResNet101 87.58 89.10 93.16 92.75 91.69 92.45 93.45 94.06 93.26 92.06 90.69 93.09
Average 88.37 89.23 90.26 91.18 91.43 91.67 91.69 92.38 93.02 93.00 92.63 92.55
58
Figure 5-7: Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for Abnormality Detection
Table 5-14, Table 5-15 and Table 5-16 show the validation accuracy for the detection of specific
TB manifestation among seven TB related diseases on NIH Chest X-Ray8 Dataset with different
train/valid ratios respectively. Among all CNN models, ResNet34 presents an outstanding
performance during the multi-class classification process in both raw and modified conditions.
Figure 5-8 illustrates a parallel comparison of the averaged validation accuracies provided by both
raw and modified models for the detection of TB related diseases. The plot shows that the accuracy
as well as its growth rate achieved by the modified models during the training process present an
obvious increasing trend than that by raw models. In addition, accuracies achieved on the
validation set with train/valid ratio equals to 9:1 is higher than those with 8:2 and then 7:3 in
general.
In general, the overall accuracies achieved by CNN models with modified structures are
significantly higher than those by models in their original structures.
59
Table 5-14: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on NIH Chest X-Ray8
Dataset for TB Related Disease Detection
CNN
Model
Train/Valid = 7:3
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 41.68 43.01 44.71 46.29 47.17 48.01 48.92 48.33 49.27 49.95 50.88 51.29
VGG19 34.51 36.95 37.79 38.80 41.26 42.10 44.05 45.75 47.55 48.33 48.68 50.15
Incep V3 41.30 41.83 43.04 44.72 45.27 46.72 47.63 48.44 48.77 49.02 50.06 51.23
ResNet34 39.88 40.74 42.30 44.12 45.13 46.07 48.06 48.92 49.40 50.88 51.06 52.22
ResNet50 40.37 42.54 44.66 45.78 46.53 47.24 48.08 49.73 49.88 51.36 52.14 52.42
ResNet101 37.02 40.31 41.85 43.10 46.15 47.60 48.23 48.38 49.27 50.06 51.07 51.75
Average 39.13 40.90 42.39 43.80 45.34 46.29 47.50 48.26 49.02 49.93 50.65 51.51
Modified
VGG16 52.78 55.31 61.66 69.73 71.06 76.98 78.39 79.78 81.10 80.52 81.85 81.51
VGG19 54.53 57.40 64.66 70.85 72.20 76.69 78.10 80.52 81.13 81.97 81.68 81.85
Incep V3 52.22 54.53 57.24 61.66 66.56 72.66 76.69 78.68 80.30 81.14 81.45 80.52
ResNet34 51.06 55.31 59.32 64.66 66.17 71.74 73.38 75.24 77.70 79.78 81.89 82.42
ResNet50 51.75 54.46 57.24 61.37 65.36 67.02 71.45 73.19 75.24 79.37 81.10 80.30
ResNet101 53.37 59.32 61.06 66.83 69.27 73.33 74.38 76.66 78.36 79.75 82.71 82.47
Average 62.62 56.06 60.20 65.85 68.44 73.07 75.40 77.34 78.97 80.42 81.78 81.51
Table 5-15: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=8:2 on NIH Chest X-Ray8
Dataset for TB Related Disease Detection
CNN
Model
Train/Valid = 8:2
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 43.95 44.21 45.79 46.26 47.58 47.81 50.04 52.06 52.41 52.88 52.77 54.46
VGG19 41.76 43.41 45.34 46.70 48.40 49.67 50.14 50.83 51.83 52.36 52.69 53.58
Incep V3 43.94 45.35 46.07 47.41 48.48 49.18 50.36 51.69 52.45 52.69 53.17 53.60
ResNet34 45.73 46.03 47.32 48.45 49.58 50.14 51.09 52.41 52.77 52.91 54.24 54.89
ResNet50 42.62 44.14 44.96 45.70 47.48 48.52 49.61 50.70 51.70 51.93 53.41 54.46
ResNet101 39.01 42.61 43.55 42.75 46.07 48.55 48.83 49.33 52.52 52.11 53.58 55.86
Average 42.84 44.29 45.51 46.21 47.93 48.98 50.01 51.17 52.28 52.48 53.31 54.48
Modified
VGG16 62.14 70.75 73.94 76.32 79.43 80.56 81.48 82.13 84.17 84.99 85.96 85.19
VGG19 59.90 67.15 73.15 75.34 79.10 81.78 82.26 82.68 82.49 83.84 84.11 86.12
Incep V3 55.79 63.88 68.91 70.89 76.02 78.04 80.70 81.13 82.09 82.65 84.97 83.94
ResNet34 57.99 66.58 69.43 76.96 78.87 81.63 83.71 84.67 85.13 85.10 85.44 85.07
ResNet50 60.70 65.11 69.79 72.22 76.81 78.40 81.55 81.47 81.95 81.79 82.41 84.01
ResNet101 56.68 61.95 67.59 71.89 74.87 78.61 80.20 81.94 82.69 83.50 83.87 83.08
Average 58.87 65.90 70.47 73.94 77.52 79.84 81.65 82.34 83.09 83.65 84.46 84.57
60
Table 5-16: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=9:1 on NIH Chest X-Ray8
Dataset for TB Related Disease Detection
CNN
Model
Train/Valid = 9:1
Accuracy on Validation Set for Each Epoch During the Training Process (%)
E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12
Raw
VGG16 45.11 45.94 47.34 48.29 49.51 50.54 51.69 53.80 54.80 56.29 56.29 56.80
VGG19 46.69 50.57 51.23 51.49 53.09 52.23 53.66 54.94 53.86 55.57 56.14 56.83
Incep V3 46.43 47.11 48.46 49.80 51.11 51.46 52.43 52.69 55.40 55.94 57.17 58.54
ResNet34 46.11 46.40 48.43 50.06 50.34 51.09 52.80 53.60 55.37 56.06 58.20 58.17
ResNet50 43.54 45.66 46.57 48.37 48.71 50.29 51.69 52.40 53.09 55.03 55.66 57.71
ResNet101 43.26 44.63 46.20 47.49 48.43 49.74 50.71 51.86 53.17 55.09 56.91 58.97
Average 45.19 46.72 48.04 49.25 50.20 50.89 52.16 53.22 54.28 55.66 56.73 57.84
Modified
VGG16 65.52 71.43 72.31 76.46 80.29 80.20 81.82 83.22 83.63 84.02 85.17 86.40
VGG19 67.06 70.74 77.77 77.94 80.26 83.31 83.94 84.97 84.94 86.03 86.46 85.31
Incep V3 67.00 73.71 75.63 78.60 81.40 82.94 84.57 84.51 86.03 87.40 86.63 86.66
ResNet34 64.31 69.00 70.46 76.74 80.77 81.43 84.40 84.74 85.17 85.71 86.25 86.43
ResNet50 60.14 66.49 69.57 75.45 76.14 78.14 82.49 83.49 84.43 81.69 85.74 84.49
ResNet101 61.06 66.49 69.57 75.06 75.46 78.14 76.14 82.49 83.49 84.43 84.94 81.69
Average 64.18 69.64 72.55 76.71 79.05 80.69 82.23 83.90 84.62 84.88 85.87 85.16
Figure 5-8: Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for TB Related Disease Detection
61
5.5.2 Fine-Tune the Modified CNN Models Via ABC
After the model modifications, we have further improved the classification accuracy by fine-tuning
the fully connected layer of modified CNN models that have been trained on the target CXR dataset
via the ABC algorithm.
For different train/valid ratios equal to 9:1, 8:2 and 7:3 respectively, accuracy of both validation
and testing set provided by the raw, modified and fine-tuned models are presented for comparison.
Statistics that measure the detailed model performance such as recall, precision, Area Under ROC
curve (AUC) etc. will be given in Chapter 6.
Accuracy, the ratio of the correctly labeled images among all, is calculated via the following
equation:
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =∑ 𝑇𝐶𝑖
𝑛𝑖=1
𝑁
where 𝑛 represents the total number of classes, 𝑇𝐶𝑖 indicates the number of correctly classified
instances within each class, and 𝑁 is the total number of instances in the validation set.
Table 5-17 presents the validation and testing accuracy for different CNN models on Montgomery
County CXR Dataset. For each train/valid ratio, the best validation accuracy shows an increasing
trend from using raw models to the modified models, and then to the fine-tuned models via ABC.
This overall improvement of accuracy across each stage remains for the testing set.
However, there are some inconsistencies of the validation and testing accuracy achieved by the
same model. For example, among all raw models, when the train/valid ratio equals to 8:2,
ResNet34 that achieves 88% of accuracy on the validation set only receives 62.5% of accuracy on
the testing set, and VGG19 that has the best validation accuracy of 93.33% only reaches 75% of
62
the testing accuracy. The same problem also exists in the modified and fine-tuned CNN models.
Reasons that may cause this inconsistent performance varies. The first reason is that since the
validation set is involved in the training and fine-tuning process to evaluate a model’s performance
after each epoch of parameter updating based on the learning from the training set, the model might
have been overfitted which presents a high accuracy on the validation set but poor on the testing
set. Moreover, as the smallest CXR dataset used in our experiment, there are only 8 images in the
testing set, this small number of test cases is not representative and might lead to a biased result.
This is also the main cause of the fluctuations on the testing accuracy.
Table 5-17: Ratio Validation and Testing Accuracy Results on Montgomery County Chest X-Ray Dataset for
Abnormality Detection
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 82.50% 75.00% 84.00% 100% 86.67% 87.50%
VGG19 90.00% 75.00% 88.00% 87.50% 93.33% 75.00%
Inception V3 85.00% 87.50% 84.00% 87.50% 86.67% 87.50%
ResNet34 87.50% 87.50% 88.00% 62.50% 86.67% 100%
ResNet50 90.00% 87.50% 88.00% 75.00% 93.33% 100%
ResNet101 87.50% 75.00% 88.00% 75.00% 93.33% 87.50%
Modified
Model
VGG16 92.50% 100% 92.00% 87.50% 93.33% 100%
VGG19 97.50% 100% 96.00% 87.50% 100% 87.50%
Inception V3 90.00% 100% 92.00% 100% 93.33% 62.50%
ResNet34 95.00% 62.50% 92.00% 100% 93.33% 75.00%
ResNet50 95.00% 87.50% 96.00% 50.00% 100% 75.00%
ResNet101 95.00% 87.50% 96.00% 87.50% 100% 87.50%
Modified
Model
Fine-tuned
by ABC
VGG16 95.00% 87.50% 92.00% 87.50% 100% 75.00%
VGG19 100 % 100% 100% 75.00% 100% 87.50%
Inception V3 95.00% 62.50% 92.00% 100% 93.33% 87.50%
ResNet34 97.50% 75.00% 96.00% 75.00% 100% 100%
ResNet50 97.50% 87.50% 100% 87.50% 100% 87.50%
ResNet101 100% 100% 100% 100% 100% 87.50%
63
Table 5-18 gives the validation and testing accuracy achieved by different CNN models on
Shenzhen Hospital CXR Dataset. For train/valid ratio equals to 7:3, 8:2, 9:1 respectively, the
accuracy of both validation and testing set increases following the three stages of our experiment.
The inconsistent performance of models on validation and testing set also exists in this dataset.
Since there are only 22 images in the testing set which is still a relatively small number and not
representative enough, the appearance of the fluctuations and unstable performance between the
validation and testing set is acceptable to some extent.
Table 5-18: Ratio Validation and Testing Accuracy Results on Shenzhen Hospital Chest X-Ray Dataset for
Abnormality Detection
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 87.03% 81.82% 89.23% 77.27% 90.77% 68.18%
VGG19 87.03% 77.27% 89.23% 50.00% 90.77% 63.64%
Inception V3 87.03% 81.82% 90.77% 72.73% 90.77% 72.73%
ResNet34 88.11% 77.27% 91.54% 72.73% 92.31% 86.36%
ResNet50 87.03% 68.18% 90.77% 77.27% 92.31% 86.36%
ResNet101 88.11% 72.73% 91.54% 81.82% 93.85% 77.27%
Modified
Model
VGG16 89.73% 86.36% 92.31% 81.82% 92.31% 86.36%
VGG19 91.89% 86.36% 92.31% 90.91% 93.85% 81.82%
Inception V3 91.35% 86.36% 92.31% 86.36% 93.85% 77.27%
ResNet34 91.89% 86.36% 95.38% 72.73% 95.38% 77.27%
ResNet50 91.35% 86.36% 94.62% 68.18% 95.38% 90.91%
ResNet101 91.89% 77.27% 96.15% 86.36% 96.92% 77.27%
Modified
Model
Fine-tuned
by ABC
VGG16 92.43% 86.36% 92.31% 81.82% 93.85% 72.73%
VGG19 93.51% 90.91% 93.08% 81.82% 95.38% 86.36%
Inception V3 91.89% 77.27% 93.08% 86.36% 93.85% 77.27%
ResNet34 92.43% 72.73% 96.15% 81.82% 96.92% 81.82%
ResNet50 94.05% 86.36% 95.39% 90.90% 96.92% 86.36%
ResNet101 92.97% 81.82% 96.92% 90.90% 98.46% 95.45%
64
Table 5-19 presents the validation and testing accuracy results on NIH Chest X-Ray8 Dataset for
lung abnormality detection. Same as the previous two datasets, the accuracy of both validation and
testing set keeps increasing from using raw models to the modified models, and then to the fine-
tuned models for all train/valid ratio cases. Besides, the inconsistent performance on validation
and testing set no longer exists since the number of images in the testing set has been increased to
over 2,000. This larger data used for testing greatly reduce the fluctuations of the model’s
performance on testing cases and thus will provide a more unbiased and accurate measure of a
model in practical applications. Moreover, as the largest dataset used in our experiment, data used
for training is also very large and comprehensive. Therefore, even for the original CNN models
with relatively low accuracy on the validation set, they still present a good performance on the
testing set.
Table 5-19: Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8 Dataset for Abnormality
Detection
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 80.31% 90.11% 83.09% 93.03% 88.16% 92.33%
VGG19 81.17% 90.76% 82.66% 92.37% 87.36% 91.72%
Inception V3 81.64% 91.94% 84.06% 91.50% 88.54% 93.86%
ResNet34 82.54% 92.46% 85.38% 94.77% 88.07% 93.68%
ResNet50 80.92% 92.37% 83.35% 91.24% 89.78% 92.98%
ResNet101 82.20% 92.64% 83.81% 89.32% 89.48% 94.60%
Modified
Model
VGG16 87.29% 96.69% 90.22% 97.69% 93.69% 97.91%
VGG19 87.68% 97.69% 90.38% 97.30% 93.22% 98.00%
Inception V3 87.99% 96.51% 91.09% 98.82% 93.60% 97.65%
ResNet34 88.19% 97.21% 91.31% 97.21% 93.86% 97.82%
ResNet50 88.03% 97.86% 90.53% 97.86% 94.42% 98.17%
ResNet101 88.44% 97.99% 90.87% 97.78% 94.06% 98.61%
Modified
Model
Fine-tuned
by ABC
VGG16 87.97% 97.60% 91.15% 98.56% 93.82% 98.00%
VGG19 88.06% 96.73% 90.86% 98.39% 94.16% 98.04%
Inception V3 88.74% 98.21% 91.49% 97.91% 94.48% 97.65%
ResNet34 88.42% 97.91% 91.40% 97.08% 94.81% 98.26%
ResNet50 88.84% 98.26% 90.96% 98.61% 94.61% 98.87%
ResNet101 88.69% 97.39% 90.97% 98.78% 94.12% 96.82%
65
Table 5-20 presents the accuracy of different CNN models on both validation and testing set of the
NIH Chest X-Ray8 Dataset for the diagnose of specific lung diseases. Compared to the
performance of lung abnormality detection, TB related disease detection is a more complicated
task which requires the model to have the ability to recognize multiple diseases’ patterns and make
predictions. Therefore, the overall performance on finishing this task is not as good as on the
abnormality detection. Same as the trend observed from the experiments that have been done for
lung abnormality detection previously, the accuracy of both validation and testing set keeps
increasing from using raw models to the modified models, and then to the fine-tuned models for
all train/valid ratio cases. Moreover, compared to the fine-tuning process, modification of the CNN
structure contributes more to improving the classification accuracy.
Table 5-20: Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8 Dataset for TB Related Disease
Detection
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 51.29% 54.05% 54.46% 53.88% 56.80% 58.82%
VGG19 50.15% 54.00% 53.58% 52.71% 56.83% 57.41%
Inception V3 51.23% 48.24% 53.60% 52.71% 58.54% 57.65%
ResNet34 52.22% 53.41% 54.89% 56.47% 58.20% 59.53%
ResNet50 52.42% 48.00% 54.46% 48.00% 57.71% 59.06%
ResNet101 51.75% 53.18% 55.86% 54.35% 58.97% 61.18%
Modified
Model
VGG16 81.85% 81.65% 85.96% 88.94% 86.40% 88.71%
VGG19 81.97% 77.65% 86.12% 83.29% 86.46% 86.35%
Inception V3 81.45% 80.47% 84.97% 85.41% 87.40% 90.59%
ResNet34 82.42% 84.24% 85.44% 86.35% 86.43% 88.70%
ResNet50 81.10% 80.24% 84.01% 84.24% 85.74% 88.24%
ResNet101 82.70% 80.71% 83.87% 83.29% 84.94% 82.59%
Modified
Model
Fine-tuned
by ABC
VGG16 82.63% 83.06% 86.26% 87.29% 87.40% 87.53%
VGG19 82.47% 80.94% 86.68% 84.94% 86.94% 87.53%
Inception V3 82.81% 85.18% 85.62% 87.53% 88.86% 88.24%
ResNet34 83.62% 84.00% 86.07% 86.59% 86.54% 85.65%
ResNet50 82.66% 85.41% 84.11% 83.76% 86.43% 86.59%
ResNet101 83.94% 84.00% 84.54% 78.59% 85.00% 84.71%
66
5.5.3 Discussion and Conclusion
In this chapter, the performance of the CNN models for the diagnose of TB and its specific
manifestations have been compared on the three public CXR datasets. Results have been given
and discussed in 1.5.1 and 1.5.2.
From the result analysis, modified CNN models generally present a significant improvement of
the whole training process and thereby receive a higher accuracy on the validation sets of all three
datasets with different train/valid ratios compare to the raw models. Besides, modified CNN
models that have been fine-tuned via the ABC algorithm slightly improve the models’ performance
and generally achieve the highest validation accuracy across all models.
However, the inconsistent performance on validation and testing set provided by the same model
exist in the experiment on the two smaller datasets, Montgomery County Chest X-Ray Dataset and
Shenzhen Hospital Chest X-Ray Dataset, since the number of images in the testing set is very
small so that data used for testing is not representative enough and will cause the fluctuations on
the testing accuracy. This inconsistency problem doesn’t show up in NIH Chest X-Ray8 Dataset
since the great number of testing images eliminates the fluctuations and helps provide a more
unbiased and accurate measure of a CNN model during the prediction task for cases that haven’t
seen before. Moreover, the performance of each participated CNN model varies on each dataset
with different train/valid ratios. Both the amount of data and the model itself will bring a lot of
unstable and unpredictable factors to influence the model’s final performance. This also makes it
harder to find the best model that can be used for TB diagnosis.
Therefore, the concept of ensemble CNN structure will be proposed in chapter 6 to help generate
a stable model with consistent and better performance regardless of the external conditions.
67
Chapter 6
Increasing Accuracy of TB Detection via Ensemble Model
6.1 Ensemble Learning
Ensemble learning [65] is a machine learning method that integrates multiple learners and
generates the final output based on the results provided by the integrated learners to achieve better
performance. Learners that are used for ensemble purposes need to maintain sufficient diversity to
capture different features from the same target. Therefore, two factors need to be considered during
the ensemble learning process, one is the selection and training of each learner, the other is how to
organize the results from different learners to produce the final output.
So far, different ensemble learning methods such as bagging, boosting, stacking etc. have been
proposed and widely used for specific tasks.
Bagging method [66] starts with the training data. It randomly generates multiple subsets from the
training set and trains each classifier on one of the generated subsets. Later on, all those weakly
trained classifiers will be integrated following a certain algorithm to generate the ensemble
architecture. This method uses different training subsets to ensure the diversity of classifiers.
Boosting method [67] mainly focus on reducing the bias among base models. At first, each data in
the training set has been assigned the same weight. During the constant training and validation,
the weights of data that has been misclassified will be increased while others remain the same.
68
Multiple classifiers are generated via different weight updating processes to maintain diversity and
are used to produce the ensemble architecture.
Stacking method [68] aims at training different classifiers on the target dataset by stacking one on
top of the other so that the newly trained classifier will correct the errors from the previous
classifier. Theoretically, this method involves all ensemble methods mentioned above.
In general, all ensemble methods contain two steps: generating a distribution of simple models
based on the original data and combining the distribution into one aggregated model. Figure 6-1
presents the basic structure of an ensemble model.
Figure 6-1: Ensemble Model Structure
According to the illustration, the manipulated target training dataset has been trained to generate a
series of classifiers (tier-1 classifiers), outputs obtained from these classifiers will then be feed into
a tier-2 classifier (also known as meta-classifier) and reorganized to get a more unbiased output.
TIER-2 CLASSIFIER
Tra
inin
g D
ata
Met
a C
lass
ifie
r
Data
M
an
ipu
lati
on
TIER-1 CLASSIFIERS C1
Ct-1
Ct
Ct+1
CT
69
The potential idea behind this structure is to learn data in a more unbiased way based on the
knowledge learnt by different classifiers. For example, if a classifier learns a wrong feature pattern
from the dataset, it might cause the classification error for new data that has the similar feature,
however, the tier-2 classifier of the ensemble model may learn things correctly by organizing the
knowledge from all classifiers in an unbiased way so as to compensate for the individual classifier
weaknesses and generate the right classification result.
In general, the ability to provide a trade-off between bias and variances among base models as well
as reducing the risk of overfitting provides more superiority to the ensemble model than any single
one [69]. It has now been successfully applied not only in supervised learning fields but also in
many unsupervised learning fields such as probability density estimation.
6.2 Ensemble Combination Methods Used for TB Detection
Our experiment has implemented the neural network ensemble [70] which is an ensemble learning
paradigm that jointly use six CNN models with different structures to improve the overall
performance on the detection of TB. Two different algebraic combiners, linear average based
ensemble and voting based ensemble, are implemented respectively as the tier-2 classifier to
generate the final classification result with higher accuracy and more stable performance.
6.2.1 Linear Average Based Ensemble
The idea of a linear average based ensemble is simple and intuitive, that is to calculate the linear
average of the output from all component classifiers. This method can efficiently prevent the
overfitting problem existed in one single model so that the ensemble model will be more
generalized during classification.
70
An example is shown in Figure 6-2. The green line represents the result given by a single classifier
that has been trained on the binary classification (blue dots vs. red dots) task. It shows an obvious
trend of overfitting by presenting an overly complex model which means that the classifier has
picked up the noise or random fluctuations exist in the training data. This over-complicated model
will receive a poor performance on new data so that it limits the ability to generalize. After
averaging the results that come from different classifiers trained on the same data, a smooth black
curve shows up as the final borderline to separate the dots. This simple black curve cancels the
negative impact of the noise and increases the classification accuracy on new data.
Figure 6-2: Overfitted model and linear averaged model
6.2.2 Voting Based Ensemble
Voting based ensemble generates the final result based on the common output provided by the
majority of the component classifiers. When there exist multiple groups of majorities, then the
output will be the averaging of the probabilities calculated by individual models. This method
requires diverse classifiers so that errors which a single model has fallen into will no be aggregated.
71
6.3 Experiment Descriptions
In this chapter, the comparison of the performance on both abnormalities checking and TB related
disease detection among two ensemble models and their component CNN models have been
presented and analyzed from the context of classification. Quality of the performance is evaluated
statistically using the following measures: accuracy, specificity, recall, precision, F1-score and
AUC. The experiment sets related to train-valid separation and parameters setting remains the
same as mentioned in Chapter 5.
6.4 Evaluation Metrics
Evaluation metrics is a popular method that has been used to provide an objective assessment of
the deep learning models through statistical measures. For classification tasks, the quality of
detection is measured by accuracy, recall, specificity, precision, F1-score, AUC and confusion
matrix. Among these evaluations, recall, specificity, precision as well as F1-score are mainly used
to assess the performance of binary classifications, the rest can be used for both binary and multi-
class classification.
As for the most commonly used measurement, accuracy is the ratio of the number of correct
predictions to the total number of input samples. During the abnormality detection, the number of
pathological samples that have been correctly classified is called true positive (TP) and the number
of correctly classified normal samples is called true negative (TN). The number of pathological
samples that have been incorrectly classified as normal is called false negative (FN) and the
number of incorrectly classified normal samples is called false positive (FP).
Recall is a measurement of how many people with TB are correctly identified as having the TB
related manifestation and is also called true positive rate (TPR). Specificity aims to evaluate how
72
many healthy people are correctly identified as not having the TB related manifestation, and is
also called true negative rate (TNR). Precision is used to measure how many people with TB have
been correctly identified among all the samples that have been identified as having TB. As the
harmonic mean between precision and recall, F1-score is a measurement of how precise and how
robust the classifier is. The calculations of recall, specificity, precision and F1-score are given as
follows:
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃𝑅 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑅 =𝑇𝑁
𝑇𝑁 + 𝐹𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
The statistical interpretation of AUC is that if choosing a case under a certain class randomly, the
probability that the selected class outranks other classes. This value is independent of the threshold
set for classification because it only considers the rank of each prediction.
An ideal classifier is supposed to attain a high value for all of the introduced evaluation metrics.
6.5 Results and Discussion
6.5.1 Lung Abnormality Detection
Table 6-1, Table 6-2, Table 6-3 and Table 6-4 present the validation and testing accuracy as well
as the evaluation metrics introduced in 6.4 for different CNN models with train/valid ratios equal
to 7:3, 8:2, 9:1 respectively on Montgomery County Chest X-Ray Dataset.
73
Table 6-1: Ratio Validation and Testing Accuracy Results on Montgomery Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 82.50% 75.00% 84.00% 100% 86.67% 87.50%
VGG19 90.00% 75.00% 88.00% 87.50% 93.33% 75.00%
Inception V3 85.00% 87.50% 84.00% 87.50% 86.67% 87.50%
ResNet34 87.50% 87.50% 88.00% 62.50% 86.67% 100%
ResNet50 90.00% 87.50% 88.00% 75.00% 93.33% 100%
ResNet101 87.50% 75.00% 88.00% 75.00% 93.33% 87.50%
Ensemble-Linear 92.50% 87.50% 92.00% 100% 93.33% 100%
Ensemble-Voting 92.50% 87.50% 92.00% 100% 93.33% 100%
Modified
Model
VGG16 92.50% 100% 92.00% 87.50% 93.33% 100%
VGG19 97.50% 100% 96.00% 87.50% 100% 87.50%
Inception V3 90.00% 100% 92.00% 100% 93.33% 62.50%
ResNet34 95.00% 62.50% 92.00% 100% 93.33% 75.00%
ResNet50 95.00% 87.50% 96.00% 50.00% 100% 75.00%
ResNet101 95.00% 87.50% 96.00% 87.50% 100% 87.50%
Ensemble-Linear 97.50% 100% 100% 100% 100% 100%
Ensemble-Voting 97.50% 100% 100% 100% 100% 100%
Modified
Model
Fine-tuned
by ABC
VGG16 95.00% 87.50% 92.00% 87.50% 100% 75.00%
VGG19 100 % 100% 100% 75.00% 100% 87.50%
Inception V3 95.00% 62.50% 92.00% 100% 93.33% 87.50%
ResNet34 97.50% 75.00% 96.00% 75.00% 100% 100%
ResNet50 97.50% 87.50% 100% 87.50% 100% 87.50%
ResNet101 100% 100% 100% 100% 100% 87.50%
Ensemble-Linear 100% 100% 100% 100% 100% 100%
Ensemble-Voting 100% 100% 100% 100% 100% 100%
74
Table 6-2: Statistical Model Analysis with Train/Valid Ratio = 7:3 on Montgomery Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 7:3
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.913 0.706 0.857 0.774 0.926
VGG19 0.870 0.941 0.842 0.889 0.916
InceptionV3 0.783 0.941 0.762 0.842 0.882
ResNet34 0.913 0.824 0.875 0.848 0.941
ResNet50 0.957 0.824 0.933 0.875 0.982
ResNet101 0.783 1.000 0.773 0.872 0.923
Ensemble-Linear 0.913 0.941 0.889 0.914 0.972
Ensemble-Voting 0.913 0.941 0.889 0.914 0.941
Modified
Model
VGG16 0.957 0.882 0.938 0.909 0.980
VGG19 1.000 0.941 1.000 0.970 1.000
Inception V3 0.913 0.882 0.882 0.882 0.967
ResNet34 1.000 0.882 1.000 0.938 0.972
ResNet50 0.913 0.941 0.889 0.914 0.987
ResNet101 0.913 1.000 0.895 0.944 0.969
Ensemble-Linear 0.957 1.000 0.944 0.971 1.000
Ensemble-Voting 0.957 1.000 0.944 0.971 1.000
Modified
Model Fine-
tuned by
ABC
VGG16 0.957 0.941 0.941 0.941 0.992
VGG19 1.000 1.000 1.000 1.000 1.000
Inception V3 0.957 0.941 0.941 0.941 0.990
ResNet34 1.000 0.941 1.000 0.970 1.000
ResNe50 1.000 0.941 1.000 0.970 0.985
ResNet101 1.000 1.000 1.000 1.000 1.000
Ensemble-Linear 1.000 1.000 1.000 1.000 1.000
Ensemble-Voting 1.000 1.000 1.000 1.000 1.000
75
Table 6-3: Statistical Model Analysis with Train/Valid Ratio = 8:2 on Montgomery Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 8:2
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 1.000 0.600 1.000 0.750 0.940
VGG19 0.933 0.800 0.889 0.842 0.927
InceptionV3 0.733 1.000 0.714 0.833 0.900
ResNet34 0.867 0.900 0.818 0.857 0.920
ResNet50 0.800 1.000 0.769 0.870 0.900
ResNet101 0.867 0.900 0.818 0.857 0.900
Ensemble-Linear 0.867 1.000 0.833 0.909 0.973
Ensemble-Voting 0.867 1.000 0.833 0.909 0.967
Modified
Model
VGG16 0.867 1.000 0.833 0.909 0.987
VGG19 0.933 1.000 0.909 0.952 1.000
Inception V3 0.867 1.000 0.833 0.909 0.953
ResNet34 0.867 1.000 0.833 0.909 0.940
ResNet50 1.000 0.900 1.000 0.947 0.973
ResNet101 1.000 0.900 1.000 0.947 0.973
Ensemble-Linear 1.000 1.000 1.000 1.000 1.000
Ensemble-Voting 1.000 1.000 1.000 1.000 1.000
Modified
Model Fine-
tuned by
ABC
VGG16 0.867 1.000 0.833 0.909 0.987
VGG19 1.000 1.000 1.000 1.000 1.000
Inception V3 0.867 1.000 0.833 0.909 0.953
ResNet34 1.000 0.900 1.000 0.947 0.947
ResNe50 1.000 1.000 1.000 1.000 1.000
ResNet101 1.000 1.000 1.000 1.000 1.000
Ensemble-Linear 1.000 1.000 1.000 1.000 1.000
Ensemble-Voting 1.000 1.000 1.000 1.000 1.000
76
Table 6-4: Statistical Model Analysis with Train/Valid Ratio = 9:1 on Montgomery Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 9:1
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.900 0.800 0.800 0.800 0.860
VGG19 0.900 1.000 0.833 0.909 1.000
InceptionV3 0.800 1.000 0.714 0.833 0.980
ResNet34 0.900 0.800 0.800 0.800 0.940
ResNet50 0.900 1.000 0.833 0.909 0.980
ResNet101 1.000 0.800 1.000 0.889 0.880
Ensemble-Linear 1.000 0.800 1.000 0.889 0.940
Ensemble-Voting 1.000 0.800 1.000 0.889 0.980
Modified
Model
VGG16 0.900 1.000 0.833 0.909 0.980
VGG19 1.000 1.000 1.000 1.000 1.000
Inception V3 0.900 1.000 0.833 0.909 1.000
ResNet34 0.900 1.000 0.833 0.909 1.000
ResNet50 1.000 1.000 1.000 1.000 1.000
ResNet101 1.000 1.000 1.000 1.000 1.000
Ensemble-Linear 1.000 1.000 1.000 1.000 1.000
Ensemble-Voting 1.000 1.000 1.000 1.000 1.000
Modified
Model Fine-
tuned by
ABC
VGG16 1.000 1.000 1.000 1.000 1.000
VGG19 1.000 1.000 1.000 1.000 1.000
Inception V3 0.900 1.000 0.833 0.909 1.000
ResNet34 1.000 1.000 1.000 1.000 1.000
ResNe50 1.000 1.000 1.000 1.000 1.000
ResNet101 1.000 1.000 1.000 1.000 1.000
Ensemble-Linear 1.000 1.000 1.000 1.000 1.000
Ensemble-Voting 1.000 1.000 1.000 1.000 1.000
From the above tables, it is obvious that for single CNN models, the consistency of their
performances on the validation set and testing set is unstable and unpredictable. Some models
generate a higher detection accuracy on the validation set than others while presenting a poor
performance during the diagnosis of the testing set. For example, during the fine-tuning process
when train/valid ratio equals to 7:3, Inception V3 and ResNet34 achieve an accuracy of 95% and
97.5% on validation set respectively but only get 62.5% and 75% on testing set. The
implementation of two ensemble models greatly improves the stability by providing consistent
performance on both sets with higher accuracy.
77
For both linear average and voting based ensemble models, the classification accuracy reaches to
100% for almost all validation as well as testing case under all train/valid ratios conditions during
the model modification and fine-tuning steps.
Table 6-5, Table 6-6, Table 6-7 and Table 6-8 present the validation and testing accuracy as well
as other statistical measurements among different CNN models on Shenzhen Hospital Chest X-
Ray Dataset with different with train/valid ratios.
Table 6-5: Ratio Validation and Testing Accuracy Results on Shenzhen Hospital Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 87.03% 81.82% 89.23% 77.27% 90.77% 68.18%
VGG19 87.03% 77.27% 89.23% 50.00% 90.77% 63.64%
Inception V3 87.03% 81.82% 90.77% 72.73% 90.77% 72.73%
ResNet34 88.11% 77.27% 91.54% 72.73% 92.31% 86.36%
ResNet50 87.03% 68.18% 90.77% 77.27% 92.31% 86.36%
ResNet101 88.11% 72.73% 91.54% 81.82% 93.85% 77.27%
Ensemble-Linear 89.19% 81.82% 93.08% 81.82% 93.85% 86.36%
Ensemble-Voting 89.19% 81.82% 93.08% 81.82% 93.85% 86.36%
Modified
Model
VGG16 89.73% 86.36% 92.31% 81.82% 92.31% 86.36%
VGG19 91.89% 86.36% 92.31% 90.91% 93.85% 81.82%
Inception V3 91.35% 86.36% 92.31% 86.36% 93.85% 77.27%
ResNet34 91.89% 86.36% 95.38% 72.73% 95.38% 77.27%
ResNet50 91.35% 86.36% 94.62% 68.18% 95.38% 90.91%
ResNet101 91.89% 77.27% 96.15% 86.36% 96.92% 77.27%
Ensemble-Linear 92.43% 90.91% 96.92% 90.91% 96.92% 90.91%
Ensemble-Voting 93.51% 90.91% 96.92% 90.91% 96.92% 90.91%
Modified
Model
Fine-tuned
by ABC
VGG16 92.43% 86.36% 92.31% 81.82% 93.85% 72.73%
VGG19 93.51% 90.91% 93.08% 81.82% 95.38% 86.36%
Inception V3 91.89% 77.27% 93.08% 86.36% 93.85% 77.27%
ResNet34 92.43% 72.73% 96.15% 81.82% 96.92% 81.82%
ResNet50 94.05% 86.36% 95.39% 90.90% 96.92% 86.36%
ResNet101 92.97% 81.82% 96.92% 90.90% 98.46% 95.45%
Ensemble-Linear 94.59% 90.91% 97.69% 95.45% 98.46% 95.45%
Ensemble-Voting 94.59% 90.91% 97.69% 95.45% 98.46% 95.45%
78
Table 6-6: Statistical Model Analysis with Train/Valid Ratio = 7:3 on Shenzhen Hospital Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 7:3
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.911 0.832 0.908 0.868 0.914
VGG19 0.933 0.811 0.928 0.865 0.932
InceptionV3 0.889 0.853 0.890 0.871 0.915
ResNet34 0.878 0.884 0.884 0.884 0.936
ResNet50 0.911 0.832 0.908 0.868 0.922
ResNet101 0.900 0.863 0.901 0.882 0.923
Ensemble-Linear 0.911 0.874 0.912 0.892 0.940
Ensemble-Voting 0.911 0.874 0.912 0.892 0.936
Modified
Model
VGG16 0.956 0.842 0.952 0.894 0.944
VGG19 0.900 0.937 0.908 0.922 0.976
Inception V3 0.933 0.895 0.934 0.914 0.961
ResNet34 0.889 0.947 0.900 0.923 0.969
ResNet50 0.900 0.926 0.907 0.917 0.974
ResNet101 0.967 0.874 0.965 0.917 0.965
Ensemble-Linear 0.922 0.926 0.926 0.926 0.985
Ensemble-Voting 0.944 0.926 0.946 0.936 0.978
Modified
Model Fine-
tuned by
ABC
VGG16 0.911 0.937 0.918 0.927 0.975
VGG19 0.933 0.937 0.937 0.937 0.973
Inception V3 0.900 0.937 0.908 0.922 0.963
ResNet34 0.944 0.905 0.945 0.925 0.974
ResNe50 0.944 0.937 0.947 0.942 0.964
ResNet101 0.911 0.947 0.918 0.933 0.979
Ensemble-Linear 0.956 0.937 0.957 0.947 0.986
Ensemble-Voting 0.956 0.937 0.957 0.947 0.986
79
Table 6-7: Statistical Model Analysis with Train/Valid Ratio = 8:2 on Shenzhen Hospital Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 8:2
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.900 0.886 0.912 0.899 0.965
VGG19 0.967 0.829 0.967 0.892 0.938
InceptionV3 0.917 0.900 0.926 0.913 0.951
ResNet34 0.900 0.929 0.915 0.922 0.942
ResNet50 0.900 0.914 0.914 0.914 0.965
ResNet101 0.900 0.929 0.915 0.922 0.960
Ensemble-Linear 0.933 0.929 0.942 0.935 0.975
Ensemble-Voting 0.933 0.929 0.942 0.935 0.974
Modified
Model
VGG16 0.933 0.914 0.941 0.928 0.975
VGG19 0.950 0.900 0.955 0.926 0.978
Inception V3 0.933 0.914 0.941 0.928 0.971
ResNet34 0.983 0.929 0.985 0.956 0.983
ResNet50 0.933 0.957 0.985 0.956 0.980
ResNet101 0.983 0.943 0.985 0.964 0.988
Ensemble-Linear 0.983 0.957 0.985 0.971 0.990
Ensemble-Voting 0.983 0.957 0.985 0.971 0.990
Modified
Model Fine-
tuned by
ABC
VGG16 0.933 0.914 0.941 0.928 0.964
VGG19 0.933 0.929 0.942 0.935 0.978
Inception V3 0.917 0.943 0.930 0.936 0.973
ResNet34 0.967 0.957 0.971 0.964 0.985
ResNe50 0.933 0.971 0.944 0.958 0.986
ResNet101 0.950 0.986 0.958 0.972 0.988
Ensemble-Linear 0.967 0.986 0.972 0.979 0.991
Ensemble-Voting 0.967 0.986 0.972 0.979 0.990
80
Table 6-8: Statistical Model Analysis with Train/Valid Ratio = 9:1 on Shenzhen Hospital Chest X-Ray Dataset
CNN
Model
Train/Valid Ratio = 9:1
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.933 0.886 0.939 0.912 0.949
VGG19 0.967 0.857 0.968 0.909 0.945
InceptionV3 0.867 0.943 0.892 0.917 0.970
ResNet34 0.933 0.914 0.941 0.928 0.956
ResNet50 0.933 0.914 0.941 0.928 0.956
ResNet101 0.933 0.943 0.943 0.943 0.969
Ensemble-Linear 0.933 0.943 0.943 0.943 0.974
Ensemble-Voting 0.933 0.943 0.943 0.943 0.960
Modified
Model
VGG16 0.967 0.886 0.969 0.925 0.969
VGG19 0.933 0.943 0.943 0.943 0.982
Inception V3 0.900 0.971 0.919 0.944 0.980
ResNet34 0.967 0.943 0.971 0.957 0.980
ResNet50 0.933 0.971 0.944 0.958 0.979
ResNet101 0.967 0.971 0.971 0.971 0.982
Ensemble-Linear 0.967 0.971 0.971 0.971 0.984
Ensemble-Voting 0.967 0.971 0.971 0.971 0.984
Modified
Model Fine-
tuned by
ABC
VGG16 0.967 0.914 0.970 0.941 0.976
VGG19 0.967 0.943 0.971 0.957 0.976
Inception V3 0.900 0.971 0.919 0.944 0.980
ResNet34 0.933 1.000 0.946 0.972 0.990
ResNe50 1.000 0.943 1.000 0.971 0.991
ResNet101 1.000 0.971 1.000 0.986 0.999
Ensemble-Linear 1.000 0.971 1.000 0.986 0.994
Ensemble-Voting 1.000 0.971 1.000 0.986 0.990
The result from the above four tables shows that unstable problems still exist during the
classification of CXR images. The two ensemble models have solved this problem by providing a
stable performance with the highest accuracy and evaluation metrics values during the prediction
on the target dataset within all three improvement steps under various train/valid ratios conditions.
The difference in performances provided by both linear average and voting based ensemble models
is very little. Yet in general, the linear average based ensemble model shows slightly better
performances compare to the voting-based one.
81
Table 6-9, Table 6-10, Table 6-11 and Table 6-12 present the validation and testing accuracy as
well as other statistical measurements among different CNN models on NIH Chest X-Ray8 Dataset
with different with train/valid ratios.
Table 6-9: Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8 Dataset
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 80.31% 90.11% 83.09% 93.03% 88.16% 92.33%
VGG19 81.17% 90.76% 82.66% 92.37% 87.36% 91.72%
Inception V3 81.64% 91.94% 84.06% 91.50% 88.54% 93.86%
ResNet34 82.54% 92.46% 85.38% 94.77% 88.07% 93.68%
ResNet50 80.92% 92.37% 83.35% 91.24% 89.78% 92.98%
ResNet101 82.20% 92.64% 83.81% 89.32% 89.48% 94.60%
Ensemble-Linear 83.28% 94.25% 85.76% 94.82% 90.60% 95.73%
Ensemble-Voting 83.09% 94.38% 85.74% 94.90% 90.60% 95.89%
Modified
Model
VGG16 87.29% 96.69% 90.22% 97.69% 93.69% 97.91%
VGG19 87.68% 97.69% 90.38% 97.30% 93.22% 98.00%
Inception V3 87.99% 96.51% 91.09% 98.82% 93.60% 97.65%
ResNet34 88.19% 97.21% 91.31% 97.21% 93.86% 97.82%
ResNet50 88.03% 97.86% 90.53% 97.86% 94.42% 98.17%
ResNet101 88.44% 97.99% 90.87% 97.78% 94.06% 98.61%
Ensemble-Linear 89.19% 98.87% 91.87% 99.08% 95.07% 99.35%
Ensemble-Voting 89.11% 98.78% 91.79% 98.95% 94.96% 99.30%
Modified
Model
Fine-tuned
by ABC
VGG16 87.97% 97.60% 91.15% 98.56% 93.82% 98.00%
VGG19 88.06% 96.73% 90.86% 98.39% 94.16% 98.04%
Inception V3 88.74% 98.21% 91.49% 97.91% 94.48% 97.65%
ResNet34 88.42% 97.91% 91.40% 97.08% 94.81% 98.26%
ResNet50 88.84% 98.26% 90.96% 98.61% 94.61% 98.87%
ResNet101 88.69% 97.39% 90.97% 98.78% 94.12% 96.82%
Ensemble-Linear 89.56% 98.78% 92.07% 99.13% 95.49% 99.43%
Ensemble-Voting 89.44% 98.87% 91.93% 99.22% 95.43% 99.52%
82
Table 6-10: Statistical Model Analysis with Train/Valid Ratio = 7:3 on Chest X-Ray8 Dataset
CNN
Model
Train/Valid Ratio = 7:3
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.911 0.635 0.821 0.716 0.848
VGG19 0.892 0.687 0.803 0.741 0.855
InceptionV3 0.906 0.677 0.822 0.743 0.862
ResNet34 0.929 0.665 0.857 0.749 0.871
ResNet50 0.929 0.623 0.849 0.719 0.855
ResNet101 0.932 0.651 0.859 0.741 0.869
Ensemble-Linear 0.940 0.666 0.877 0.757 0.882
Ensemble-Voting 0.940 0.662 0.875 0.754 0.878
Modified
Model
VGG16 0.953 0.749 0.910 0.822 0.912
VGG19 0.970 0.732 0.939 0.823 0.914
Inception V3 0.939 0.787 0.893 0.837 0.923
ResNet34 0.956 0.766 0.918 0.835 0.923
ResNet50 0.955 0.765 0.915 0.833 0.923
ResNet101 0.958 0.770 0.922 0.839 0.924
Ensemble-Linear 0.969 0.771 0.942 0.848 0.931
Ensemble-Voting 0.969 0.769 0.942 0.847 0.930
Modified
Model Fine-
tuned by
ABC
VGG16 0.950 0.770 0.908 0.834 0.920
VGG19 0.939 0.789 0.893 0.838 0.924
Inception V3 0.964 0.769 0.931 0.842 0.924
ResNet34 0.962 0.763 0.928 0.873 0.927
ResNe50 0.961 0.776 0.927 0.845 0.928
ResNet101 0.948 0.792 0.907 0.846 0.929
Ensemble-Linear 0.967 0.785 0.938 0.855 0.934
Ensemble-Voting 0.966 0.783 0.937 0.853 0.933
83
Table 6-11: Statistical Model Analysis with Train/Valid Ratio = 8:2 on NIH Chest X-Ray8 Dataset
CNN
Model
Train/Valid Ratio = 8:2
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.935 0.669 0.868 0.756 0.884
VGG19 0.941 0.649 0.875 0.745 0.875
InceptionV3 0.909 0.734 0.838 0.783 0.892
ResNet34 0.952 0.700 0.904 0.789 0.899
ResNet50 0.898 0.733 0.822 0.775 0.884
ResNet101 0.894 0.752 0.819 0.784 0.891
Ensemble-Linear 0.947 0.718 0.897 0.798 0.909
Ensemble-Voting 0.946 0.718 0.896 0.797 0.905
Modified
Model
VGG16 0.962 0.809 0.932 0.866 0.942
VGG19 0.945 0.839 0.908 0.872 0.944
Inception V3 0.969 0.820 0.944 0.878 0.951
ResNet34 0.944 0.865 0.908 0.886 0.954
ResNet50 0.958 0.824 0.926 0.872 0.945
ResNet101 0.970 0.816 0.945 0.876 0.948
Ensemble-Linear 0.966 0.845 0.941 0.890 0.955
Ensemble-Voting 0.969 0.837 0.946 0.888 0.954
Modified
Model Fine-
tuned by
ABC
VGG16 0.971 0.819 0.948 0.879 0.951
VGG19 0.950 0.844 0.916 0.878 0.950
Inception V3 0.955 0.852 0.924 0.887 0.952
ResNet34 0.948 0.860 0.914 0.887 0.955
ResNe50 0.963 0.827 0.934 0.877 0.948
ResNet101 0.970 0.816 0.945 0.876 0.948
Ensemble-Linear 0.970 0.844 0.947 0.893 0.958
Ensemble-Voting 0.969 0.841 0.946 0.891 0.957
84
Table 6-12: Statistical Model Analysis with Train/Valid Ratio = 9:1 on NIH Chest X-Ray8 Dataset
CNN
Model
Train/Valid Ratio = 9:1
Specificity Recall Precision F1-Score AUC
Raw Model
VGG16 0.935 0.799 0.887 0.841 0.929
VGG19 0.925 0.793 0.872 0.831 0.922
InceptionV3 0.949 0.787 0.908 0.843 0.932
ResNet34 0.936 0.794 0.889 0.839 0.931
ResNet50 0.935 0.840 0.893 0.865 0.945
ResNet101 0.962 0.789 0.931 0.854 0.936
Ensemble-Linear 0.960 0.822 0.929 0.873 0.947
Ensemble-Voting 0.962 0.818 0.933 0.872 0.946
Modified
Model
VGG16 0.975 0.877 0.958 0.916 0.965
VGG19 0.985 0.851 0.973 0.908 0.965
Inception V3 0.965 0.890 0.943 0.916 0.966
ResNet34 0.974 0.883 0.956 0.918 0.965
ResNet50 0.982 0.886 0.969 0.926 0.970
ResNet101 0.973 0.891 0.954 0.922 0.970
Ensemble-Linear 0.987 0.894 0.979 0.934 0.972
Ensemble-Voting 0.987 0.892 0.977 0.933 0.972
Modified
Model Fine-
tuned by
ABC
VGG16 0.976 0.879 0.960 0.918 0.965
VGG19 0.977 0.887 0.961 0.922 0.968
Inception V3 0.970 0.906 0.950 0.928 0.974
ResNet34 0.978 0.901 0.964 0.931 0.972
ResNe50 0.978 0.896 0.964 0.929 0.970
ResNet101 0.961 0.911 0.937 0.924 0.969
Ensemble-Linear 0.985 0.909 0.974 0.940 0.976
Ensemble-Voting 0.984 0.908 0.974 0.940 0.975
The statistical analysis presented by the above four tables shows that during each improvement
steps, the difference in the performance of each CNN model has been decreased compare to the
results on the first two datasets. This is because the dataset used here is so far the largest publicly
available CXR dataset, the rapid increment of the number of images improve the quality of the
training process and therefore generate a positive influence on model stabilization. The
employment of the ensemble models further improves the stability as well as the overall
performance on lung abnormality detections. However, the complexity of the diagnoses increased
with the increasing amount of data, the prediction accuracy is lower compared to the results on the
85
first two datasets. Besides, the linear average based ensemble model presents slightly better
performance than the voting-based ensemble model under various train/valid ratios conditions.
6.5.2 TB Related Disease Diagnosis
Table 6-13 shows the validation and testing accuracy among different CNN models on NIH CXR8
Dataset with different with train/valid ratios for the diagnosis of the specific TB manifestations
among seven TB related diseases.
Table 6-13: Ratio Validation Accuracy and Testing Results on NIH Chest X-Ray8 Dataset for Specific TB Related
Disease Diagnosis
CNN
Model
Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Valid
Accuracy
Test
Accuracy
Raw
Model
VGG16 51.29% 54.05% 54.46% 53.88% 56.80% 58.82%
VGG19 50.15% 54.00% 53.58% 52.71% 56.83% 57.41%
Inception V3 51.23% 48.24% 53.60% 52.71% 58.54% 57.65%
ResNet34 52.22% 53.41% 54.89% 56.47% 58.20% 59.53%
ResNet50 52.42% 48.00% 54.46% 48.00% 57.71% 59.06%
ResNet101 51.75% 53.18% 55.86% 54.35% 58.97% 61.18%
Ensemble-Linear 55.70% 54.59% 58.73% 57.65% 63.57% 62.35%
Ensemble-Voting 55.50% 55.76% 58.65% 56.94% 62.66% 63.29%
Modified
Model
VGG16 81.85% 81.65% 85.96% 88.94% 86.40% 88.71%
VGG19 81.97% 77.65% 86.12% 83.29% 86.46% 86.35%
Inception V3 81.45% 80.47% 84.97% 85.41% 87.40% 90.59%
ResNet34 82.42% 84.24% 85.44% 86.35% 86.43% 88.70%
ResNet50 81.10% 80.24% 84.01% 84.24% 85.74% 88.24%
ResNet101 82.70% 80.71% 83.87% 83.29% 84.94% 82.59%
Ensemble-Linear 88.74% 89.41% 90.19% 91.06% 91.11% 93.88%
Ensemble-Voting 88.53% 89.65% 90.11% 91.29% 90.94% 93.18%
Modified
Model
Fine-tuned
by ABC
VGG16 82.63% 83.06% 86.26% 87.29% 87.40% 87.53%
VGG19 82.47% 80.94% 86.68% 84.94% 86.94% 87.53%
Inception V3 82.81% 85.18% 85.62% 87.53% 88.86% 88.24%
ResNet34 83.62% 84.00% 86.07% 86.59% 86.54% 85.65%
ResNet50 82.66% 85.41% 84.11% 83.76% 86.43% 86.59%
ResNet101 83.94% 84.00% 84.54% 78.59% 85.00% 84.71%
Ensemble-Linear 89.27% 90.12% 90.54% 92.47% 91.29% 95.06%
Ensemble-Voting 89.01% 90.59% 90.44% 91.76% 91.14% 94.82%
86
Table 6-14, Table 6-15 and Table 6-16 present the AUC score during the detection of each disease
among all CNN models when train/valid ratio equals to 7:3, 8:2 and 9:1 respectively.
Table 6-14: AUC Scores with Train/Valid Ratio = 7:3 on NIH Chest X-Ray8 Dataset
AUC Score
CNN Models
Train/Valid Ratio = 7:3
Consolidation Effusion Fibrosis Infiltration Mass Nodule Pleural
Thickening
Raw
Model
VGG16 0.866 0.899 0.839 0.900 0.832 0.801 0.810
VGG19 0.851 0.892 0.865 0.829 0.892 0.824 0.834
Inception V3 0.874 0.903 0.834 0.897 0.790 0.816 0.810
ResNet34 0.877 0.905 0.841 0.901 0.834 0.813 0.824
ResNet50 0.887 0.905 0.837 0.896 0.823 0.801 0.823
ResNet101 0.883 0.899 0.838 0.891 0.805 0.811 0.811
Ensemble-Linear 0.902 0.921 0.863 0.915 0.856 0.833 0.838
Ensemble-Voting 0.899 0.920 0.857 0.909 0.848 0.830 0.836
Modified
Model
VGG16 0.958 0.981 0.965 0.975 0.979 0.975 0.961
VGG19 0.968 0.976 0.961 0.967 0.982 0.982 0.964
Inception V3 0.952 0.979 0.963 0.969 0.976 0.974 0.965
ResNet34 0.977 0.981 0.963 0.971 0.982 0.970 0.965
ResNet50 0.959 0.980 0.958 0.967 0.975 0.973 0.960
ResNet101 0.969 0.981 0.965 0.972 0.982 0.974 0.970
Ensemble-Linear 0.978 0.990 0.979 0.985 0.992 0.991 0.981
Ensemble-Voting 0.975 0.989 0.975 0.983 0.989 0.988 0.979
Modified
Model
Fine-
tuned by
ABC
VGG16 0.959 0.979 0.963 0.978 0.984 0.981 0.965
VGG19 0.962 0.980 0.964 0.977 0.983 0.982 0.965
Inception V3 0.950 0.982 0.968 0.973 0.982 0.979 0.962
ResNet34 0.967 0.980 0.971 0.975 0.986 0.979 0.966
ResNet50 0.964 0.981 0.968 0.975 0.980 0.978 0.969
ResNet101 0.954 0.982 0.967 0.976 0.985 0.981 0.972
Ensemble-Linear 0.975 0.990 0.979 0.988 0.993 0.993 0.982
Ensemble-Voting 0.971 0.989 0.977 0.987 0.991 0.990 0.980
87
Table 6-15: AUC Scores with Train/Valid Ratio = 8:2 on NIH Chest X-Ray8 Dataset
AUC Score
CNN Models
Train/Valid Ratio = 8:2
Consolidation Effusion Fibrosis Infiltration Mass Nodule Pleural
Thickening
Raw
Model
VGG16 0.912 0.916 0. 852 0.897 0.853 0.804 0.844
VGG19 0.886 0.913 0.846 0.904 0.859 0.803 0.838
Inception V3 0.904 0.907 0.843 0.913 0.827 0.812 0.837
ResNet34 0.901 0.914 0.864 0.900 0.856 0.813 0.848
ResNet50 0.913 0.905 0.851 0.898 0.839 0.815 0.856
ResNet101 0.914 0.923 0.862 0.907 0.853 0.824 0.855
Ensemble-Linear 0.932 0.933 0.876 0.920 0.885 0.839 0.873
Ensemble-Voting 0.929 0.931 0.874 0.917 0.883 0.835 0.872
Modified
Model
VGG16 0.962 0.986 0.969 0.980 0.987 0.987 0.975
VGG19 0.970 0.985 0.976 0.980 0.986 0.988 0.966
Inception V3 0.967 0.983 0.968 0.979 0.977 0.985 0.972
ResNet34 0.962 0.984 0.976 0.979 0.982 0.986 0.971
ResNet50 0.963 0.977 0.967 0.979 0.979 0.984 0.972
ResNet101 0.967 0.982 0.971 0.978 0.980 0.985 0.968
Ensemble-Linear 0.978 0.991 0.983 0.990 0.993 0.995 0.982
Ensemble-Voting 0.971 0.990 0.981 0.989 0.989 0.993 0.981
Modified
Model
Fine-
tuned by
ABC
VGG16 0.957 0.987 0.978 0.981 0.987 0.989 0.973
VGG19 0.973 0.986 0.978 0.978 0.987 0.990 0.972
Inception V3 0.970 0.986 0.978 0.979 0.984 0.988 0.972
ResNet34 0.962 0.987 0.971 0.983 0.988 0.988 0.974
ResNet50 0.946 0.981 0.968 0.977 0.958 0.974 0.971
ResNet101 0.970 0.984 0.976 0.979 0.985 0.987 0.977
Ensemble-Linear 0.979 0.993 0.984 0.990 0.993 0.996 0.985
Ensemble-Voting 0.974 0.991 0.981 0.989 0.990 0.994 0.982
88
Table 6-16: AUC Scores with Train/Valid Ratio = 9:1 on NIH Chest X-Ray8 Dataset
AUC Score
CNN Models
Train/Valid Ratio = 9:1
Consolidation Effusion Fibrosis Infiltration Mass Nodule Pleural
Thickening
Raw
Model
VGG16 0.935 0.927 0.855 0.903 0.863 0.831 0.869
VGG19 0.921 0.907 0.865 0.918 0.869 0.824 0.868
Inception V3 0.936 0.917 0.870 0.921 0.859 0.841 0.864
ResNet34 0.935 0.920 0.881 0.918 0.862 0.840 0.875
ResNet50 0.922 0.917 0.865 0.911 0.873 0.839 0.867
ResNet101 0.932 0.921 0.882 0.916 0.880 0.832 0.891
Ensemble-Linear 0.956 0.938 0.900 0.931 0.907 0.866 0.901
Ensemble-Voting 0.952 0.935 0.899 0.929 0.899 0.860 0.898
Modified
Model
VGG16 0.968 0.987 0.971 0.987 0.984 0.989 0.972
VGG19 0.966 0.984 0.972 0.978 0.985 0.989 0.966
Inception V3 0.960 0.982 0.971 0.986 0.984 0.984 0.963
ResNet34 0.960 0.986 0.973 0.985 0.991 0.984 0.966
ResNet50 0.974 0.985 0.974 0.981 0.986 0.986 0.964
ResNet101 0.968 0.985 0.967 0.977 0.989 0.989 0.967
Ensemble-Linear 0.976 0.992 0.983 0.992 0.994 0.996 0.978
Ensemble-Voting 0.973 0.990 0.982 0.991 0.990 0.994 0.972
Modified
Model
Fine-
tuned by
ABC
VGG16 0.960 0.985 0.972 0.984 0.988 0.990 0.963
VGG19 0.945 0.984 0.970 0.987 0.987 0.988 0.974
Inception V3 0.976 0.987 0.970 0.988 0.989 0.992 0.969
ResNet34 0.964 0.987 0.970 0.983 0.985 0.989 0.969
ResNet50 0.963 0.984 0.970 0.983 0.983 0.985 0.974
ResNet101 0.948 0.982 0.966 0.975 0.982 0.983 0.967
Ensemble-Linear 0.973 0.991 0.985 0.994 0.992 0.996 0.979
Ensemble-Voting 0.969 0.990 0.982 0.993 0.990 0.993 0.974
The accuracy results provided by Table 6-13 indicate a continuously increasing trend on the
accuracy achieved by each model for the diagnosis of specific TB manifestation among seven TB
related diseases during the improvement steps. Within each step, the two proposed ensemble
models provide stable performance on both validation and testing set with the highest accuracy
compare to each base CNN model. The accuracies achieved by ensemble models are significantly
higher than that by single CNN models under all train/valid ratio conditions. The highest accuracy
reaches to 91.29% on the validation set and 95.06% on the testing set provided by linear average
based ensemble model when train/valid ratio equals to 9:1.
89
From the AUC scores shown in Table 6-14, Table 6-15 and Table 6-16, we can see that among
seven different TB related diseases, consolidation, infiltration and pleural thickening have a lower
probability to be correctly detected by the CNN model compare to other diseases. The ensemble
models still present a stable performance and better probability of generating the prediction on the
right disease that outranks the others. The ensemble model based on linear average achieves a
slightly better performance than that based on voting mechanism.
6.6 Conclusion
In this chapter, the concept and working mechanism of the ensemble model have been proposed
and implemented to solve the unstable problem exists in CNN models during the process of TB
detection. During the experiment, both linear average based and voting based ensemble models
have been employed at each improvement steps to compare with individual models for both binary
classifications on abnormality detection as well as multi-class classification on the diagnose of
specific TB disease respectively. Evaluation metrics have been used to measure the overall
performance of each model on the given task.
The experiment shows that the ensemble models not only can improve the detection accuracy but
also can provide consistent and stable performance on both validation and testing set under all
conditions.
90
Chapter 7
Disease Localization via Class Activation Mapping
From the statistical results of the experiments that have been mentioned in chapter 5 and chapter
6, CNN architectures provide satisfactory performances with high accuracy in both TB
abnormality detection and specific TB manifestations diagnosis.
However, as a “black box” model, the characteristic of opaqueness makes the result that comes
from the CNN model not interpretable. This will greatly impact the application of CNN in medical
image detections. When implementing as a computer-aided detection system for medical purposes,
doctors and radiologists will not only focus on the predicted result given by the model but also pay
more attention to which part of the input data makes the model generates the judgment in order to
better understand the result from the model’s vision and make a more accurate conclusion based
on that.
Therefore, to ensure the reliability of the results and better resemble human decision-making, Class
Activation Mapping (CAM) which helps to reveal the extracted features of a CNN model into an
interpretable form will be discussed and implemented in this chapter. This method will be mainly
used as providing the visualization on the localizing TB manifestations, results given by different
CNN models (VGG16, VGG19, Inception V3, ResNet34, ResNet50, ResNet101 and ensemble
CNN models) will be displayed on CXRs from the NIH ChestX-Ray8 dataset.
7.1 Class Activation Mapping
The concept of class activation mapping was proposed by Zhou et al. in [71]. This method fully
utilized the remarkable ability of pattern recognizing and localization that exists in CNN models
91
to expose their implicit attention on the target image. Besides, with the simple processing of the
internal parameters within a CNN, it successfully integrates two different functions, image
classification and object localization, on the same model. The generated attention map based on
the input image identifies the regions that become the model’s main predicting criteria during the
classification process.
In a CNN model that contains a global average pooling layer, feature maps that are generated from
the last convolutional layer are processed with global average pooling to generate a vector. The
obtained vector will then be used to calculate the weighted summation with the parameters of the
fully connected layers to generate the output that can be used for classification. Therefore, the
weights from the last layer before the output can be projected back to the feature maps by
connecting with the pooling layer to identify areas which the model calculated as containing the
important information.
Figure 7-1 illustrates the working mechanism of class activation mapping. According to the figure,
the weighted sum of the final weights and their corresponding feature maps will be used to generate
the attention map of the input image.
.
.
.
.
.
.
Class 1
Class 2
Class n
W 2,2
W 2,1
W 2,m
Global Average Pooling
C
O
N
V
C
O
N
V
C
O
N
V
W 2,1 * + W 2,2 * + ... + W 2,m * =
Attention Map
92
Figure 7-1: Ensemble model structure
For a given input, suppose 𝑓𝑘(𝑥, 𝑦) represents the activation value of the neuron in the 𝑘-th feature
map of the last convolutional layer with spatial position (𝑥, 𝑦), for each feature map, after being
processed by global average pooling, the result will be:
𝐹𝑘 = ∑ 𝑓𝑘𝑥,𝑦
(𝑥, 𝑦)
If the input image belongs to class 𝑐, the result that comes from the layer before SoftMax which
indicates the score that the model evaluates for class 𝑐 will be:
𝑆𝑐 = ∑ 𝑤𝑘𝑐𝐹𝑘 = ∑ 𝑤𝑘
𝑐 ∑ 𝑓𝑘𝑥,𝑦
(𝑥, 𝑦)𝑘
=𝑘
∑ ∑ 𝑤𝑘𝑐𝑓𝑘(𝑥, 𝑦)
𝑘𝑥,𝑦
where 𝑤𝑘𝑐 represents the averaged weights of the 𝑘 -th feature map corresponding to class 𝑐 .
Moreover, since the bias doesn’t influence much on the generation of the attention map, this term
will no longer be considered during the calculation.
Therefore, from the above calculations, pixel values of the attention map on class 𝑐 can denoted
as:
𝑀𝑐(𝑥, 𝑦) = ∑ 𝑤𝑘𝑐𝑓𝑘
𝑘(𝑥, 𝑦)
By zooming in the attention map to the same size as the input image and overlaying one on the
other, the visualization of how much each region of the input affects the classification result will
be generated. This localization of detected objects makes the predicted result that come from the
“black box” model more interpretable and easier for researchers to understand the classification
process.
93
7.2 Experiment Setup
In our experiment, class activation mapping is implemented with the multi-class classification for
specific TB manifestation localization purposes. For each TB related manifestation (consolidation,
effusion, fibrosis, infiltration, mass, nodule and pleural thickening), we run the prediction results
together with class activation mapping on the six trained CNN models and two ensemble models
introduce in chapter 6 to make a parallel comparison on the general performance of disease
prediction and localization.
7.3 Results and Analysis
Figure 7-2 ∼ Figure 7-8 present both the prediction and localization results on various test cases
of TB manifestations given by CNN models that are trained on NIH Chest X-Ray8 dataset.
94
Figure 7-2: Diagnosis and localization of consolidation
Input Image
True Label: Consolidation
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
95
Figure 7-3: Diagnosis and localization of effusion
Input Image
True Label: Effusion
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
96
Figure 7-4: Diagnosis and localization of fibrosis
Input Image
True Label: Fibrosis
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
97
Figure 7-5: Diagnosis and localization of infiltration
Input Image
True Label: Infiltration
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
98
Figure 7-6: Diagnosis and localization of mass
Input Image
True Label: Mass
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
99
Figure 7-7: Diagnosis and localization of nodule
Input Image
True Label: Nodule
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
100
Figure 7-8: Diagnosis and localization of pleural thickening
Input Image
True Label: Pleural Thickening
Disease Prediction and Localization by Single CNN Models
Disease Prediction and Localization by Ensembled CNN Models
101
From the performances on different test cases shown above that are resulted from the six different
individual CNN models and their ensembled models for the diagnosis and localization of TB
manifestations, we can observe the following:
1. Not all the single models will provide the correct diagnosis for the input CXR image. For
example, during the prediction of infiltration, two CNN models, VGG16 and ResNet101, both
present wrong predictions with fibrosis, and ResNet101 provides a wrong prediction on
fibrosis. This makes it harder for making diagnosis decision by just relying on an individual
model. Moreover, the selection of which CNN model to use during the disease detection
process also becomes a problem. However, with the employment of the ensemble models, the
wrong predictions among each individual will be balanced so that the overall detection
accuracy is improved.
2. The performance on the localization of disease manifestations provided by individual CNN
model presents to be unstable and inaccurate even with the correct disease diagnosis result.
During the localization of effusion, the manifestation is supposed to be located at bilateral
lung tip site, however, ResNet34 presents an inaccurate heat map which covers almost the
whole lung part of the patient but not the lower left lung tip. Similar problems exist in
ResNet101 on the localization of fibrosis as well as InceptionV3 and ResNet34 on the nodule
case shown above. This partially covering and uncovering of lung diseases part on a CXR
image may cause great confusion to people who are using the computer-aided detection
system. Moreover, the problem of selecting among different CNN models remains. Ensemble
models solve these problems by considering the results of all CNN models and properly
integrating them. Therefore, locations of each disease provided by the ensemble models are a
102
lot more accurate and cover almost all the suspected abnormal part that is related to different
TB manifestations.
3. Even when there are some mis-predicted cases generated by single CNN models, the diagnosis
and localization of different TB manifestations remain a stable performance with high
accuracy.
4. The diagnosis and localization results of different TB manifestations provided by the two
ensemble models via linear average as well as voting mechanism are the same.
7.4 Conclusion
In this section, we compared the disease localization performances among six single CNN models
and two ensemble models on CXR image of seven different TB manifestations.
During the experiment, the predicted results are given with a confidence measure and have been
compared with the true labels. Quality of the class activation mapping during the localization of
TB manifestations are evaluated by the visual comparison between the part that has been displayed
in the heat map with high energy and our prior knowledge to the disease.
The results present a great improvement of the overall performance of the two ensemble models
proposed in chapter 6 compare to that generated by six individual models on both diagnosis and
localization of seven TB manifestations. Moreover, the implementation of class activation
mapping provides an effective way of visualizing the location of the suspected abnormality on
CXR and therefore reduced the complexity of the visual understanding of CNN.
103
Chapter 8
Conclusions and Future Work
8.1 Conclusions
The main objective of our study was to create a computer-aided detection system via CNN models
for medical purposes. We have explored different deep CNN models (VGGNet, GoogLeNet
Inception Model and ResNet) which vary in the structure of modules as well as number of layers.
We presented a unified modification to the structure of last few layers before the output and added
an extra fine-tuning step to the training process. Ensemble model based on those improved CNN
models is then implemented and employed to further improve the diagnosis accuracy and the
overall stability of the computer-aided detection system.
Accuracy, specificity, recall, precision, F1-score and AUC are used to measure the performance
on the abnormality detection from CXR images provided by six deep CNN models and two
ensemble models. During the identification of specific TB manifestations, accuracy and AUC are
presented and compared. At last, class activation mapping is implemented on CNN models to
provide a visualization of the suspected disease location on CXRs.
The experiment has been tested on three CXR datasets that are publicly available. Two small
datasets are mainly used for abnormality detection and generate the prediction of whether the
patient is TB positive based on the input CXR. The largest dataset is used for both abnormality
detection and the diagnosis of specific disease among seven TB manifestations.
104
Our results show that with the superimposition of the improvement steps, the overall performance
of CNN models keeps getting better. Among the three steps, structure modifications generate the
largest increment on the prediction accuracy for single CNN models. The fine-tuning step helps to
slightly improve performance. By taking the ensemble of the individual CNN models, the
classification accuracy of CXRs is further improved. Moreover, each model presents an unstable
and unpredictable performance on different datasets and for different classification tasks, with the
employment of the ensemble models, the classification accuracy reaches to the highest and the
stability of the performance was greatly improved. Even for the disease localization task, ensemble
models can present the best results compared to single CNN models. To conclude, the combination
of the three improvement steps can greatly improve the overall performance of our proposed
computer-aided detection system used for TB diagnosis and localization.
The contributions of the thesis can be summarized as follows:
1) We selected three CXR datasets from various sources with various sizes to solve the potential
problems that exist in a single dataset, such as data being not representative and limited only
to abnormality detection tasks.
2) A unified standard data preprocessing procedure was adopted which involved removing
images with bad quality, cropping the images with large black background, enhancing the
images using CLAHE algorithm and image resizing, thereby improving the quality of the
images and eliminating the unnecessary error caused by the variation in the image quality.
3) Unified modifications on CNN models were made and different learning rates were applied
to different layer groups inside the CNN model, which provided improvements on all 3
datasets for both lung abnormality detection as well as the detection of TB related
manifestations in NIH chest x-ray dataset.
105
4) To maximize the performance of the CNN models on each diagnostic task, Artificial Bee
Colony (ABC) optimization was implemented as an additional fine-tuning step.
5) Linear average based and voting based ensemble learning methods were used to combine the
results from each model into an aggregated result to prevent the overfitting problem and to
improve the stability of the models’ performance.
6) Class activation mapping algorithm using CNN’s built-in attention mechanism was
implemented to localize the suspected area of the detected disease on CXR for better
interpretation of the diagnostic results. The visualization of the suspected area of the disease
can help clinicians to confirm the disease and to catch missed information from unsuspecting
eyes.
7) The proposed system achieves an accuracy of 100% and an AUC of 1.0 on the Montgomery
CXR dataset for the abnormality detection task with all 3 training/validation ratios. Compared
to similar work on these datasets mentioned in Chapter 2, this is the best performance. For the
abnormality detection task on Shenzhen Hospital Dataset, an overall accuracy of over 94%
and an average AUC of over 0.99 with all 3 training/validation ratios was achieved and is still
the best performance among all the experiments done by other researchers. The performance
achieved on NIH Chest X-Ray dataset for the abnormality detection range from an accuracy
of 89.56% to 95.49% with training/validation ratios 7:3, 8:2, 9:1 and an average AUC of over
0.93. For the detection of 7 TB related lung diseases on the same dataset, an overall accuracy
of 90% was achieved with all 3 training/validation ratios. Moreover, the AUC of 0.97 was
achieved for each TB related lung disease. This outstanding performance is so far the best
compared to similar work done either on the same dataset or other large CXR datasets.
106
8.2 Future work
Some aspects were not covered due to the lack of time as well as the help of experts in the medical
field. During the fine-tuning process, the computational complexity, as well as the computational
time consumptions were not measured. Few works have been mentioned to decrease the computing
time during the model’s training process. In addition, for the disease localization, the comparison
between the performance of implementing class activation map in CNN models and the
employment of other object detection methods such as single-shot detection [72] and Fast R-CNN
[73] was not performed since the latter algorithms need the information of the specific disease
manifestations part located by radiologists on the CXR images.
107
References
[1] World Health Organization, 2018. Global tuberculosis report 2018. Geneva: World Health
Organization.
[2] Adler D, Richards WF. Consolidation in primary pulmonary tuberculosis. Thorax. 1953
Sep;8(3):223.
[3] Vorster MJ, Allwood BW, Diacon AH, Koegelenberg CF. Tuberculous pleural effusions:
advances and controversies. Journal of thoracic disease. 2015 Jun;7(6):981.
[4] Chung MJ, Goo JM, Im JG. Pulmonary tuberculosis in patients with idiopathic pulmonary
fibrosis. European journal of radiology. 2004 Nov 1;52(2):175-9.
[5] Mishin V, Nazarova NV, Kononen AS, Miakishev TV, Sadovski AI. Infiltrative pulmonary
tuberculosis: course and efficiency of treatment. Problemy tuberkuleza i boleznei legkikh.
2006(10):7-12.
[6] Cherian MJ, Dahniya MH, Al‐Marzouk NF, Abel A, Bader S, Buerki K, Mahdi OZ.
Pulmonary tuberculosis presenting as mass lesions and simulating neoplasms in adults.
Australasian radiology. 1998 Nov;42(4):303-8.
[7] Kant S, Kushwaha R, Verma SK. Bilateral nodular pulmonary tuberculosis simulating
metastatic lung cancer. The Internet Journal of Pulmonary Medicine. 2007;8.
[8] Gil V, Soler JJ, Cordero PJ. Pleural Thickening in Patients With Pleural Tuberculosis.
Chest. 1994 Apr 1;105(4):1296.
[9] Lowe, D.G., 1999, September. Object recognition from local scale-invariant features.
In iccv (Vol. 99, No. 2, pp. 1150-1157).
[10] Ojala, T., Pietikäinen, M. and Mäenpää, T., 2002. Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. IEEE Transactions on Pattern
Analysis & Machine Intelligence, (7), pp.971-987.
[11] Basavaprasad, B. and Ravi, M., 2014. A study on the importance of image processing and
its applications. IJRET: International Journal of Research in Engineering and
Technology, 3.
[12] Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and
problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based
Systems. 1998 Apr;6(02):107-16.
[13] Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks.
InInternational Conference on Machine Learning 2013 Feb 13 (pp. 1310-1318).
108
[14] Huang Z, Pan Z, Lei B. Transfer learning with deep convolutional neural network for SAR
target classification with limited labeled data. Remote Sensing. 2017 Aug 31;9(9):907.
[15] Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on knowledge and data
engineering. 2010 Oct 1;22(10):1345-59.
[16] Khuzi, A.M., Besar, R., Zaki, W.W. and Ahmad, N.N., 2009. Identification of masses in
digital mammogram using gray level co-occurrence matrices. Biomedical imaging and
intervention journal, 5(3).
[17] Carrillo-de-Gea, J.M., García-Mateos, G., Fernández-Alemán, J.L. and Hernández-
Hernández, J.L., 2016. A computer-aided detection system for digital chest
radiographs. Journal of healthcare engineering, 2016.
[18] Yang, M.C., Moon, W.K., Wang, Y.C.F., Bae, M.S., Huang, C.S., Chen, J.H. and Chang,
R.F., 2013. Robust texture analysis using multi-resolution gray-scale invariant features for
breast sonographic tumor diagnosis. IEEE Transactions on Medical Imaging, 32(12),
pp.2262-2273.
[19] Sarraf, S. and Tofighi, G., 2016. DeepAD: Alzheimer′ s disease classification via deep
convolutional neural networks using MRI and fMRI. BioRxiv, p.070441.
[20] Huynh, B.Q., Li, H. and Giger, M.L., 2016. Digital mammographic tumor classification
using transfer learning from deep convolutional neural networks. Journal of Medical
Imaging, 3(3), p.034501.
[21] Zou, L., Zheng, J., Miao, C., Mckeown, M.J. and Wang, Z.J., 2017. 3D CNN based
automatic diagnosis of attention deficit hyperactivity disorder using functional and
structural MRI. IEEE Access, 5, pp.23626-23636.
[22] Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A. and Mougiakakou, S., 2016.
Lung pattern classification for interstitial lung diseases using a deep convolutional neural
network. IEEE transactions on medical imaging, 35(5), pp.1207-1216.
[23] Kim, G.B., Jung, K.H., Lee, Y., Kim, H.J., Kim, N., Jun, S., Seo, J.B. and Lynch, D.A.,
2018. Comparison of shallow and deep learning methods on classifying the regional pattern
of diffuse lung disease. Journal of digital imaging, 31(4), pp.415-424.
[24] Jaiswal, A.K., Tiwari, P., Kumar, S., Gupta, D., Khanna, A. and Rodrigues, J.J., 2019.
Identifying Pneumonia in Chest X-Rays: A Deep Learning Approach. Measurement.
[25] Lakhani, P. and Sundaram, B., 2017. Deep learning at chest radiography: automated
classification of pulmonary tuberculosis by using convolutional neural
networks. Radiology, 284(2), pp.574-582.
[26] Pasa, F., Golkov, V., Pfeiffer, F., Cremers, D. and Pfeiffer, D., 2019. Efficient Deep
Network Architectures for Fast Chest X-Ray Tuberculosis Screening and
Visualization. Scientific reports, 9(1), p.6268.
109
[27] Hwang, S., Kim, H.E., Jeong, J. and Kim, H.J., 2016, March. A novel approach for
tuberculosis screening based on deep convolutional neural networks. In Medical imaging
2016: computer-aided diagnosis (Vol. 9785, p. 97852W). International Society for Optics
and Photonics.
[28] Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A.,
Langlotz, C., Shpanskaya, K. and Lungren, M.P., 2017. Chexnet: Radiologist-level
pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225.
[29] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D. and
Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection:
CNN architectures, dataset characteristics and transfer learning. IEEE transactions on
medical imaging, 35(5), pp.1285-1298.
[30] Flach, P.A., 2003. The geometry of ROC space: understanding machine learning metrics
through ROC isometrics. In Proceedings of the 20th international conference on machine
learning (ICML-03) (pp. 194-201).
[31] Stallkamp, J., Schlipsing, M., Salmen, J. and Igel, C., 2012. Man vs. computer:
Benchmarking machine learning algorithms for traffic sign recognition. Neural
networks, 32, pp.323-332.
[32] Vidyasaraswathi HN, Hanumantharaju MC. Review of Various Histogram Based Medical
Image Enhancement Techniques. InProceedings of the 2015 International Conference on
Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015)
2015 Mar 6 (p. 48). ACM.
[33] Rajaraman, S., Candemir, S., Xue, Z., Alderson, P.O., Kohli, M., Abuya, J., Thoma, G.R.
and Antani, S., 2018, July. A novel stacked generalization of models for improved TB
detection in chest radiographs. In 2018 40th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society (EMBC) (pp. 718-721). IEEE.
[34] Rere, L.M., Fanany, M.I. and Arymurthy, A.M., 2016. Metaheuristic algorithms for
convolution neural network. Computational intelligence and neuroscience, 2016.
[35] Parmaksızoğlu, S. and Alçı, M., 2011. A novel cloning template designing method by using
an artificial bee colony algorithm for edge detection of cnn based imaging
sensors. Sensors, 11(5), pp.5337-5359.
[36] Khan, S., Khan, A., Maqsood, M., Aadil, F. and Ghazanfar, M.A., 2019. Optimized gabor
feature extraction for mass classification using cuckoo search for big data e-
healthcare. Journal of Grid Computing, 17(2), pp.239-254.
[37] Islam, M.T., Aowal, M.A., Minhaz, A.T. and Ashraf, K., 2017. Abnormality detection and
localization in chest x-rays using deep convolutional neural networks. arXiv preprint
arXiv:1705.09850.
110
[38] Kwaśniewska, A., Rumiński, J. and Rad, P., 2017, July. Deep features class activation map
for thermal face detection and tracking. In 2017 10th International Conference on Human
System Interactions (HSI) (pp. 41-47). IEEE.
[39] Jaeger, S., Candemir, S., Antani, S., Wáng, Y.X.J., Lu, P.X. and Thoma, G., 2014. Two
public chest X-ray datasets for computer-aided screening of pulmonary
diseases. Quantitative imaging in medicine and surgery, 4(6), p.475.
[40] Jaeger, S., Karargyris, A., Candemir, S., Folio, L., Siegelman, J., Callaghan, F., Xue, Z.,
Palaniappan, K., Singh, R.K., Antani, S. and Thoma, G., 2013. Automatic tuberculosis
screening using chest radiographs. IEEE transactions on medical imaging, 33(2), pp.233-
245.
[41] Candemir, S., Jaeger, S., Palaniappan, K., Musco, J.P., Singh, R.K., Xue, Z., Karargyris,
A., Antani, S., Thoma, G. and McDonald, C.J., 2013. Lung segmentation in chest
radiographs using anatomical atlases with nonrigid registration. IEEE transactions on
medical imaging, 33(2), pp.577-590.
[42] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M. and Summers, R.M., 2017. ChestX-Ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification
and localization of common thorax diseases. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 2097-2106).
[43] Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J. and Summers, R.M., 2016.
Learning to read chest x-rays: Recurrent neural cascade model for automated image
annotation. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 2497-2506).
[44] Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar
Romeny, B., Zimmerman, J.B. and Zuiderveld, K., 1987. Adaptive histogram equalization
and its variations. Computer vision, graphics, and image processing, 39(3), pp.355-368.
[45] Pisano, E.D., Zong, S., Hemminger, B.M., DeLuca, M., Johnston, R.E., Muller, K.,
Braeuning, M.P. and Pizer, S.M., 1998. Contrast limited adaptive histogram equalization
image processing to improve the detection of simulated spiculations in dense
mammograms. Journal of Digital imaging, 11(4), p.193.
[46] Hinton, G.E. and Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with
neural networks. science, 313(5786), pp.504-507.
[47] Bengio, Y., 2009. Learning deep architectures for AI. Foundations and trends® in Machine
Learning, 2(1), pp.1-127.
[48] Zhang, W., Tanida, J., Itoh, K. and Ichioka, Y., 1988. Shift-invariant pattern recognition
neural network and its optical architecture. In Proceedings of annual conference of the
Japan Society of Applied Physics.
111
[49] Fukushima, K., 1980. Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position. Biological
Cybernetics, 36(4), pp.193-202.
[50] Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. Deep inside convolutional networks:
Visualising image classification models and saliency maps. arXiv preprint
arXiv:1312.6034.
[51] Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and understanding
convolutional networks. In European conference on computer vision (pp. 818-833).
Springer, Cham.
[52] Zeiler, M.D. and Fergus, R., 2013. Stochastic pooling for regularization of deep
convolutional neural networks. arXiv preprint arXiv:1301.3557.
[53] Boureau, Y.L., Le Roux, N., Bach, F., Ponce, J. and LeCun, Y., 2011, November. Ask the
locals: multi-way local pooling for image recognition. In ICCV'11-The 13th International
Conference on Computer Vision.
[54] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.
[55] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V. and Rabinovich, A., 2015. Going deeper with convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 1-9).
[56] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethinking the
inception architecture for computer vision. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 2818-2826).
[57] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
770-778).
[58] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D. and
Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection:
CNN architectures, dataset characteristics and transfer learning. IEEE transactions on
medical imaging, 35(5), pp.1285-1298.
[59] Pan, S.J. and Yang, Q., 2009. A survey on transfer learning. IEEE Transactions on
knowledge and data engineering, 22(10), pp.1345-1359.
[60] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and Fei-Fei, L., 2009, June. Imagenet: A
large-scale hierarchical image database. In 2009 IEEE conference on computer vision and
pattern recognition (pp. 248-255).
[61] Ioffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
112
[62] Karaboga, D., 2005. An idea based on honey bee swarm for numerical optimization (Vol.
200). Technical report-tr06, Erciyes University, engineering faculty, computer engineering
department.
[63] Karaboga, D. and Basturk, B., 2007. A powerful and efficient algorithm for numerical
function optimization: artificial bee colony (ABC) algorithm. Journal of global
optimization, 39(3), pp.459-471.
[64] Bullinaria, J.A. and AlYahya, K., 2014. Artificial bee colony training of neural networks.
In Nature Inspired Cooperative Strategies for Optimization (NICSO 2013) (pp. 191-201).
Springer, Cham.
[65] Oza, N.C. and Tumer, K., 2008. Classifier ensembles: Select real-world
applications. Information Fusion, 9(1), pp.4-20.
[66] Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.
[67] Freund, Y. and Schapire, R.E., 1997. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of computer and system sciences, 55(1),
pp.119-139.
[68] Wolpert, D.H., 1992. Stacked generalization. Neural networks, 5(2), pp.241-259.
[69] Kuncheva, L.I. and Whitaker, C.J., 2003. Measures of diversity in classifier ensembles and
their relationship with the ensemble accuracy. Machine learning, 51(2), pp.181-207.
[70] Zhou, Z.H., Wu, J. and Tang, W., 2002. Ensembling neural networks: many could be better
than all. Artificial intelligence, 137(1-2), pp.239-263.
[71] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2016. Learning deep
features for discriminative localization. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 2921-2929).
[72] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., 2016,
October. Ssd: Single shot multibox detector. In European conference on computer
vision (pp. 21-37). Springer, Cham.
[73] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object
detection with region proposal networks. In Advances in neural information processing
systems (pp. 91-99).
113
APPENDIX
Appendix A: Public Chest X-Ray Datasets
Montgomery County Chest X-Ray dataset and Shenzhen Hospital Chest X-Ray dataset are
available in: https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/
The NIH ChestX-Ray8 dataset and its detailed annotations is available in:
https://www.kaggle.com/nih-chest-xrays/datasets
Appendix B: Thesis Source Code
Thesis source code and result display are available in:
https://drive.google.com/open?id=1jrMz7nHhWlxZdsz4sWhWlbyIz3_s-ybF