Tuberculosis Detection and Localization from Chest X-Ray

Tuberculosis Detection and Localization from Chest X-Ray Images

using Deep Convolutional Neural Networks

By

Ruihua Guo

A thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science (MSc) in Computational Sciences

The Faculty of Graduate Studies

Laurentian University

Sudbury, Ontario, Canada

© Ruihua Guo, 2019

ii

THESIS DEFENCE COMMITTEE/COMITÉ DE SOUTENANCE DE THÈSE

Laurentian Université/Université Laurentienne

Faculty of Graduate Studies/Faculté des études supérieures

Title of Thesis

Titre de la thèse Tuberculosis Detection and Localization from Chest X-Ray Images using Deep

Convolutional Neural Networks

Name of Candidate

Nom du candidat Guo, Ruiha

Degree

Diplôme Master of Science

Department/Program Date of Defence

Département/Programme Computational Sciences Date de la soutenance October 02, 2019

APPROVED/APPROUVÉ

Thesis Examiners/Examinateurs de thèse:

Dr. Kalpdrum Passi

(Supervisor/Directeur(trice) de thèse)

Dr. Ratvinder Grewal

(Committee member/Membre du comité)

Dr. Meysar Zeinali

(Committee member/Membre du comité)

Approved for the Faculty of Graduate Studies

Approuvé pour la Faculté des études supérieures

Dr. David Lesbarrères

Monsieur David Lesbarrères

Dr. Pradeep Atrey Dean, Faculty of Graduate Studies

(External Examiner/Examinateur externe) Doyen, Faculté des études supérieures

ACCESSIBILITY CLAUSE AND PERMISSION TO USE

I, Ruiha Guo, hereby grant to Laurentian University and/or its agents the non-exclusive license to archive and make

accessible my thesis, dissertation, or project report in whole or in part in all forms of media, now or for the duration

of my copyright ownership. I retain all other ownership rights to the copyright of the thesis, dissertation or project

report. I also reserve the right to use in future works (such as articles or books) all or part of this thesis, dissertation,

or project report. I further agree that permission for copying of this thesis in any manner, in whole or in part, for

scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their

absence, by the Head of the Department in which my thesis work was done. It is understood that any copying or

publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written

permission. It is also understood that this copy is being made available in this form by the authority of the copyright

owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted

by the copyright laws without written authority from the copyright owner.

iii

Abstract

Tuberculosis (TB) is a lung disease that is highly contagious and continues to be a major cause of

death during the past few decades worldwide. As the most efficient and cost-effective imaging

method for medical purposes, chest X-Rays (CXRs) have been widely used as the preliminary tool

for diagnosing TB. The automatic detection of TB and the localization of suspected areas which

contain the disease manifestations with high accuracy will greatly improve the general quality of

the diagnosis processes. This thesis discusses and introduces some methods to improve the

accuracy and stability of different deep convolutional neural network (CNN) models (VGG16,

VGG19, Inception V3, ResNet34, ResNet50 and ResNet101) that are used for TB detection. The

proposed method includes three major processes: modifications on CNN model structures, model

fine-tuning via artificial bee colony algorithm, and the implementation of the ensemble CNN

model. Comparisons of the overall performance are made for all three stages among various CNN

models on three CXR datasets (Montgomery County Chest X-Ray dataset, Shenzhen Hospital

Chest X-Ray dataset and NIH Chest X-Ray8 dataset). The tested performance includes the

detection of abnormalities in CXRs and the diagnosis of different manifestations of TB. Moreover,

class activation mapping is employed to visualize the localization of the detected manifestations

on CXRs and make the diagnosis result visually convincing. The implementation of our proposed

methods have the ability to assist doctors and radiologists in generating a well-informed decision

during the detection of TB.

iv

Keywords

Tuberculosis, chest X-ray, automatic detection, deep CNN model, artificial bee colony algorithm,

ensemble CNN model, class activation mapping

v

Acknowledgments

First of all, I would like to express my sincere thanks to my supervisor, Dr. Kalpdrum Passi, for

the time, advice and resources he has generously shared with me. In the early stage of my master

study, he provided me with insightful guidance on the direction of my research. During the process

of my studying, his great patience and consistent support assist me to overcome many difficulties

step by step. Without him, I would not have been able to complete all these works.

Secondly, I want to say thanks to my fantastic lab mate, Stefan Klaassen. Thank you for spending

time on brainstorming different ideas that help enrich my research and making the lab an enjoyable

place to work throughout the years. Also, I would like to give my special thanks to my boyfriend,

Yukun Shi, who supports me with guidance to solve the problems that I encountered and comforts

me with great patience when I was depressed.

Finally, I want to give my appreciation to my parents. It is with their understanding, support as

well as consistent encouragement both mentally and financially, I got the courage and

determination to pursue my second master in Canada. There is no me without you.

vi

Table of Contents

Abstract……………………………………………............................. iii

Acknowledgments…………………………………………………… v

Table of Contents…………………………………………………….. vi

List of Figures………………………………………………………... ix

List of Tables…………………………………………………………. xi

1 Introduction…………………………………………………………. 1

1.1 Research Background and Motivations…………………………... 1

1.2 Computer-aided Detection of Medical Images………………........ 2

1.3 Medical Image Preprocessing…………………………………….. 4

1.4 Image Classification and Disease Localization Techniques……… 4

1.5 Thesis Objectives and Outlines…………………………………... 6

2 Literature Review…………………………………………………... 8

3 Chest X-Ray Datasets and Image Enhancement Methods………. 13

3.1 Dataset Selection…..……………………………………………..

3.1.1 Montgomery County Chest X-Ray Dataset…………….

3.1.2 Shenzhen Hospital Chest X-Ray Dataset……………….

3.1.3 NIH Chest X-Ray8 Dataset……………………………..

13

13

14

15

3.2 Image Enhancement Methods…………………………………….

3.2.1 Histogram Equalization (HE)…………………………...

3.2.2 Contrast Limited Adaptive Histogram Equalization

(CLAHE)………………………………………………

18

18

19

4 Deep Learning Models For Image Classification…………………. 22

4.1 Deep Learning……………………………………………………. 22

4.2 Convolutional Neural Networks (CNN)………………………….. 23

vii

4.2.1 Basic CNN Structure…………………………………….

4.2.2 CNN Working Mechanism………………………………

4.2.2 Explicit Training of CNN………………………………..

24

28

30

4.3 Significance of Applying CNN in Medical Image Recognition…... 31

4.4 Advanced CNN Models Used in the Experiment………………….

4.4.1 VGGNet…………………………………………………

4.4.2 GoogLeNet Inception Model……………………………

4.4.3 ResNet…………………………………………………...

32

32

33

34

5 TB Detection via Improved CNN Models and Artificial Bee

Colony Fine-Tuning………………………………………………… 36

5.1 Transfer Learning…………………................................................ 36

5.2 Modifications of Advanced CNN Models………………………...

5.2.1 Modifications on CNN Architecture…………………….

5.2.2 Model Division with Different Learning Rate…………...

37

37

38

5.3 Fine-Tuning the CNN Model via Artificial Bee Colony…………..

5.3.1 Artificial Bee Colony……………………………………

5.3.2 CNN Model Fine-Tuning via Artificial Bee Colony…….

39

39

43

5.4 Experiment Settings………………………………………………

5.4.1 Experiment Description…………………………………

5.4.2 Ratio Comparison……………………………………….

5.4.3 Parameter Setting……………………………………….

46

46

47

48

5.5 Results and Discussion……………………………………………

5.4.1 CNN Modification and Division with Different Learning

Rates…………………………………………………….

5.4.2 Fine-Tune the Modified CNN Models via ABC…………

5.4.3 Discussion and Conclusion………………………………

50

50

61

66

viii

6 Increasing Accuracy of TB Detection via Ensemble Model………. 67

6.1 Ensemble Learning……………………………………………… 67

6.2 Ensemble Combination Methods Used for TB Detection………..

6.2.1 Linear Averaged Based Ensemble………………………

6.2.2 Voting Based Ensemble…………………………………

69

69

70

6.3 Experiment Descriptions…………………………………………. 71

6.4 Evaluation Metrics……………………………………………….. 71

6.5 Results and Discussion………..…………………………………..

6.5.1 Lung Abnormality Detection…………………………….

6.5.2 TB Related Disease Diagnosis…………………………..

72

72

85

6.6 Conclusion……………………………………………………….. 89

7 Disease Localization via Class Activation Mapping………………. 90

7.1 Class Activation Mapping………………………………………... 91

7.2 Experiment Setup………………………………………………… 93

7.3 Results and Analysis……………………………………………… 93

7.4 Conclusion……………………………………………………….. 102

8 Conclusion and Future Work……………………………………… 103

8.1 Conclusions……………………………………………………….

8.2 Future work……………………………………………………….

References……………………………………………………………

103

106

107

Appendix……………………………………………………………..

Appendix A: Public Chest X-Ray Dataset……………………………

Appendix B: Thesis Source Code……………………………………..

113

113

113

ix

List of Figures

Figure 1-1 TB Diagnose Pipeline……………………………………………….. 6

Figure 3-1 Sample Raw Images in Montgomery Count CXR Dataset………….. 13

Figure 3-2 Sample Clinical Readings for CXR Images…………………………. 14

Figure 3-3 Sample Raw Images in Shenzhen Hospital CXR Dataset……………. 14

Figure 3-4 Sample Raw Images in NIH ChestX-Ray8 Dataset………………….. 15

Figure 3-5 Sample images with bad quality in NIH ChestX-Ray8 dataset………. 16

Figure 3-6 Sample CXR and its enhanced results together with the

corresponding histogram…………………………………………….. 20

Figure 4-1 CNN structural evolution map………………………………………. 24

Figure 4-2 CNN structure based on LeNet-5……………………………………. 25

Figure 4-3 Process of convolution………………………………………………. 25

Figure 4-4 Classic pooling working principles………………………………….. 27

Figure 4-5 Fully connected layer neuron schematic diagram……………………. 28

Figure 4-6 Inception model with dimension reduction………………………….. 34

Figure 4-7 Shortcut connection of the residual block……………………………. 35

Figure 5-1 Modifications on CNN architecture…………………………………. 37

Figure 5-2 CNN division with different learning rates…………………………... 38

Figure 5-3 Flowchart of artificial bee colony algorithm………………………… 42

Figure 5-4 Fine-tune the trained CNN model via artificial bee colony algorithm.. 45

Figure 5-5 Averaged Accuracy Comparison on Montgomery County CXR

Dataset for Abnormality Detection………………………………….. 53

Figure 5-6 Averaged Accuracy Comparison on Shenzhen Hospital CXR Dataset

for Abnormality Detection…………………………………………... 55

Figure 5-7 Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for

Abnormality Detection……………………………………………… 58

x

Figure 5-8 Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for

TB Related Disease Detection……………………………………….. 60

Figure 6-1 Ensemble Model Structure…………………………………………... 68

Figure 6-2 Overfitted model and linear averaged model………………………… 70

Figure 7-1 Ensemble model structure…………………………………………… 92

Figure 7-2 Diagnosis and localization of consolidation…………………………. 94

Figure 7-3 Diagnosis and localization of effusion………………………………. 95

Figure 7-4 Diagnosis and localization of fibrosis……………………………….. 96

Figure 7-5 Diagnosis and localization of infiltration……………………………. 97

Figure 7-6 Diagnosis and localization of mass………………………………….. 98

Figure 7-7 Diagnosis and localization of nodule………………………………… 99

Figure 7-8 Diagnosis and localization of pleural thickening…………………….. 100

xi

List of Tables

Table 5-1 Hardware Deployments……………………………………………... 48

Table 5-2 Image Separations for Abnormality Diagnosis……………………… 48

Table 5-3 Original CXR Distribution with Specific TB Manifestations in Chest

X-Ray8………………………………………………………………. 49

Table 5-4 Augmented CXR Distribution in Chest X-Ray8 for TB Manifestations

Diagnosis……………………………………………. 51

Table 5-5 Valid Accuracy of Each Epoch During Training with Train/Valid

Ratio=7:3 on Montgomery County CXR Dataset for Abnormality

Detection…………………………………………………………….. 51



Detection…………………………………………………………… 52



Detection…………………………………………………………….. 52


Ratio=7:3 on Shenzhen Hospital CXR Dataset for Abnormality

Detection……………………………………………………………... 54



Detection…………………………………………………………….. 54



Detection…………………………………………………………….. 55


Ratio=7:3 on NIH Chest X-Ray8 Dataset for Abnormality

Detection…………………………………………………………….. 56


Ratio=8:2 on NIH Chest X-Ray8 Dataset for Abnormality

Detection…........................................................................................... 57

Table 5-13 Valid Accuracy for Each Epoch During Training with Train/Valid

Ratio=9:1 on NIH Chest X-Ray8 Dataset…………………………... 57

xii


Ratio=7:3 on NIH Chest X-Ray8 Dataset for TB Related Disease

Detection…………………………………………………………….. 59



Detection…………………………………………………………….. 59



Detection…………………………………………………………….. 60

Table 5-17 Ratio Validation and Testing Accuracy Results on Montgomery

County Chest X-Ray Dataset for Abnormality Detection……………. 62

Table 5-18 Ratio Validation and Testing Accuracy Results on Shenzhen Hospital

Chest X-Ray Dataset for Abnormality Detection……………. 63

Table 5-19 Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8

Dataset for Abnormality Detection…………………………………... 64


Dataset for TB Related Disease Detection…………………………… 65

Table 6-1 Ratio Validation and Testing Accuracy Results on Montgomery Chest

X-Ray Dataset……………………………………………………….. 73

Table 6-2 Statistical Model Analysis with Train/Valid Ratio = 7:3 on

Montgomery Chest X-Ray Dataset…………………………………... 74


Montgomery Chest X-Ray Dataset………………………………….. 75


Montgomery Chest X-Ray Dataset………………………………….. 76

Table 6-5 Ratio Validation and Testing Accuracy Results on Shenzhen Hospital

Chest X-Ray Dataset…………………………………………………. 77

Table 6-6 Statistical Model Analysis with Train/Valid Ratio = 7:3 on Shenzhen

Hospital Chest X-Ray Dataset……………………………………….. 78


Hospital Chest X-Ray Dataset………………………………………. 79


Hospital Chest X-Ray Dataset……………………………………….. 80

xiii


Dataset……………………………………………………………….. 81

Table 6-10 Statistical Model Analysis with Train/Valid Ratio = 7:3 on Chest X-

Ray8 Dataset…………………………………………………………. 82

Table 6-11 Statistical Model Analysis with Train/Valid Ratio = 8:2 on NIH Chest

X-Ray8 Dataset………………………………………………………. 83

Table 6-12 Statistical Model Analysis with Train/Valid Ratio = 9:1 on NIH Chest

X-Ray8 Dataset………………………………………………………. 84

Table 6-13 Ratio Validation Accuracy and Testing Results on NIH Chest X-Ray8

Dataset for Specific TB Related Disease Diagnosis…………………. 85

Table 6-14 AUC Scores with Train/Valid Ratio = 7:3 on NIH Chest X-Ray8

Dataset……………………………………………………………….. 86


Dataset……………………………………………………………….. 87


Dataset……………………………………………………………….. 88

1

Chapter 1

Introduction

1.1 Research Background and Motivations

Tuberculosis (TB) is one of the most common and deadliest infectious diseases worldwide. In

modern society, TB has become the leading killer infectious disease, followed by malaria and

HIV/AIDS.

According to the World Health Organization (WHO), in 2017, 10 million people got ill with TB

and 1.6 million died from the disease [1]. Among the people who developed TB, most of them

come from developing countries with poor healthcare resources and medical infrastructures. Over

half of the deaths occurred because of the late detection which cause the patients to miss the best

therapeutic opportunity. Research shows that with an earlier diagnosis of TB and proper treatment,

the death rate can be greatly reduced. Thus, the early detection of TB is critical on improving

disease prevention, mitigating disease transmission and minimizing the death rate.

For the detection of TB, the best-known method is through Computer Tomography (CT).

Considering the radiation dose, cost, availability and the ability to reveal the unsuspected

pathologic alterations, the earliest diagnosis of TB are confirmed via chest X-Rays (CXRs) in most

cases. However, unlike other lung diseases which only shows single manifestation in CXRs, TB

is a more complicated disease which contains multiple manifestations such as consolidation [2],

effusion [3], fibrosis [4], infiltration [5], mass [6], nodule [7] and pleural thickening [8]. This large

variations in pathology increase the difficulty of TB detection and therefore influence the accuracy

2

of the judgment given by the doctors. Besides, it is hard to distinguish between lung abnormalities

and soft tissues with similar textures without professional training and long-term experience. In

resource-poor and marginalized areas, due to the lack of healthcare funding support and advanced

medical infrastructures, the number of radiologists who receive professional training is very

limited and the CXRs generated are low quality, which leads to a negative impact on the time delay

in TB detection as well as diagnostic accuracy. Moreover, the potential fatigue brought by the

large volume of workload on reading CXRs makes it even harder for human experts to finish their

task with a stable efficiency and quality.

Therefore, methods that can help to reduce the time delay in TB diagnosis with high quality and a

lower economic cost are needed to improve the treatment as well as minimize the reproduction of

the disease at an early stage.

1.2 Computer-aided Detection of Medical Images

As the main application of medical image pattern recognition technology, computer-aided

detection (CAD) of medical images is an interdisciplinary technology which employs the

principles from various fields, such as medical science, computer science, mathematics and

statistics etc. Relying on the high-speed computing power of the computer, CAD becomes a

powerful automated tool in pattern recognition, medical information processing and analyzes.

In general, the application of CAD in preliminary disease diagnosis based on medical images has

three main advantages. First of all, data processing via computer is more effective, the automatic

processing of quantitative data will ensure the quality of the task while improving the efficiency

to a large extent. Secondly, with the development of science and technology, the cost of computing

hardware keeps decreasing while the general performance continuously getting better. Therefore,

3

applying CAD tools for medical purposes is cost-effective, which will benefit especially for people

living in resource-poor areas. Moreover, various diagnosis results might be given by different

doctors for the same CXR because of their different medical experience and understanding of

certain diseases. Under the influence of these subjective factors, it is hard to get an objective and

unbiased judgement. With the assistance of this intelligent system, medical information exchange

among human experts and doctors in different regions can be achieved, which may help generate

a well-informed decision.

To build a CAD system that can achieve a satisfying diagnosis result, the selection and extraction

of useful pathogenic features are critical. Over the past few decades, researchers explored different

algorithms during the development of the automated system. For example, the scale-invariant

transform (SIFT) algorithm [9] has been studied and implemented to detect the local geometric

features of the image, local binary patterns (LBP) [10] is proposed and applied to extract the texture

features etc. However, traditional feature selection algorithms mainly depend on the artificial

extraction of important patterns which may contain useful information. This manual selection

process is time-consuming. Moreover, as the medical image data grows and the type of diseases

keeps increasing, problems such as having poor transferability among different datasets and

achieving unstable performances to newly generated data have stopped the CAD system from the

generation of diagnosing decisions with high quality.

Nowadays, with the rapid development in deep learning, it has continuously surpassed the

traditional recognition algorithms and achieved superhuman performance in image-based

classification and recognition problems. The superior ability of automatically extracting useful

features from the inherent characteristics of data makes deep learning the first choice for solving

medical problems. So far, CAD systems embedded with deep learning algorithms have been

4

widely studied and applied for disease prediction and highlighting suspicious features to help

effectively maintain the quality of diagnosis.

1.3 Medical Image Preprocessing

Different medical imaging infrastructures provide medical images with different qualities, to build

a CAD system with good transferability and high quality of diagnosis, image preprocessing has

become an indispensable step to improve the overall quality of data [11]. The main purpose of

image preprocessing is to enhance features on the images globally by suppressing the unwanted

distortions and make the region of interests more obvious.

Among all the image preprocessing methodologies, image enhancement which encompasses the

processes of editing images has great advantages for the general improvement of grayscale images.

It mainly focuses on highlighting the information of interest within the image, enhancing the

relative sharpness of the image by improving visual effects and making the image more conducive

to computer processing.

Two image enhancement methods, histogram equalization and contrast limited adaptive histogram

equalization are applied in our experiment and have been compared in parallel to get the better

solution for the image preprocessing step. Histograms of pixels values for each method have also

been recorded for analysis.

1.4 Image Classification and Disease Localization Techniques

In deep learning field, disease diagnosis is achieved by medical image classification via

convolutional neural network models. In general, deeper neural network architecture helps get a

better prediction accuracy. However, with the increasing depth and complexity of the network

5

model, accuracy gets saturated and then degrades rapidly, this is known as the vanishing gradient

problem [12, 13]. Advanced CNN models such as VGGNet, InceptionNet, ResNet etc. which

achieved good performances during the world image classification competition successfully

reduced the negative influence caused by this problem through their unique model structures. Thus,

the improved CNN models based on this advanced architecture with previously trained parameters

on different categories of daily objects become the first choice of building a CAD system for

disease detection.

Deep learning algorithms require the training and testing data to have the same set of features and

distribution. However, this assumption may not hold in many real-world applications. The

performance of the learners can be enhanced by using knowledge transfer [14.15], which would

transfer the knowledge from one set of medical image dataset to another. In our experiment,

coarse-to-fine knowledge transfer learning is used to train CXRs in a specific dataset for TB

diagnosis on different advanced CNN models. Details will be given in chapter 4, chapter 5 and

chapter 6.

One big problem when applying CNN models for medical purpose is the “black box” working

mechanism which makes the result hard to understand. Moreover, for disease diagnosis via

medical images, people will not only pay attention to the predicted result but also the localization

of suspected abnormalities. In our study, class activation mapping is implemented to specify the

area which contains the diseased region to make the predicted result easier to understand and thus

further improve the general performance of the CAD system.

6

1.5 Thesis Objectives and Outlines

The primary objective of the thesis is to develop an efficient CAD system for the detection and

localization of TB from CXRs. As is shown in Figure 1-1, the diagnosis of TB mainly contains

four steps: CXR image preprocessing, preliminary detection of the suspected TB patients through

abnormality checking, identification of the specific TB manifestations, and the localization of the

suspected disease on CXRs.

Figure 1-1: TB Diagnose Pipeline

In this study, we mainly focus on improving the accuracy of disease detection as well as the

stability of the general performance by modifying the structure of the six different advanced CNN

models, adding the fine-tuning step via artificial bee colony algorithm and ensemble the trained

CNN models with two methods. Moreover, class activation mapping is implemented to better

visualize the detection result effectively and efficiently.

The thesis is organized as below:

Chapter 1 includes the overall introduction of the thesis study, which mainly contains the

background, motivations, research work and objectives of TB diagnosis via CXRs. Literature

7

review will be provided to discuss the previous research done on medical image classification and

disease localization in Chapter 2.

CXR datasets that have been used in thesis study and the image enhancement methods for image

preprocessing will be introduced in Chapter 3.

In Chapter 4, basic concepts of deep learning will be presented, followed by the detailed

explanation of CNN model structures and the internal working mechanism.

Chapter 5 will propose a modified structure based on a series of selected advanced CNN models

(VGG16, VGG19, Inception V3, ResNet34, ResNet50 and ResNet101) as well as introducing the

artificial bee colony algorithm, a metaheuristic optimization method that will be used during the

fine-tuning of the trained CNN models.

Concept of ensemble learning which focuses on improving the stability and general performance

on TB detection will be introduced in Chapter 6.

Research work on the visualization of disease location from CXRs will be discussed in detail in

Chapter 7.

Conclusion, future work and other contents will be mentioned in Chapter 8.

8

Chapter 2

Literature Review

The main task of computer-aided diagnosis in medical fields is to assist doctors during the

interpretation of medical images. Pattern recognition technology plays a critical role during the

building of a disease diagnosis system. The studying of different extraction methods on possible

features such as textures, homogeneity, contrast, shapes, outlines, as well as their combinations

have been done and applied by researchers on various diseases. For example, Khuzi et al. employed

the gray-level co-occurrence matrix (GLCM), a texture descriptor via the spatial relationship

between different pixel pairs to identify masses from mammograms [16]. Local binary pattern has

been discussed and implemented in [17] to produce normality/pathology decision based on chest

X-Rays (CXRs). In [18], Yang successfully applied gray-scale invariant features for the detection

of tumor from breast ultrasound images.

However, traditional feature extraction methods are problem-specific and mainly relies on the

manual processing of medical images. With the constant discovery of new diseases and the

continuously increasing number of CXRs generated every day, the lack of the ability to make

representations of high-level concepts and poor efficiency will limit the model’s ability of

generalization as the data continues to expand. These problems have been solved by the appearance

of CNN. In the past decade, researchers have conducted many in-depth studies on the application

of CNN models in disease diagnosis from various type of medical images. In [19], deep CNN

model has been applied to help to identify the distinctions between Alzheimer’s brain and healthy

brain from magnetic resonance imaging (MRI) as well as functional MRI data for both clinical and

research purposes. The paper mentioned that this architecture provides a significant improvement

9

by achieving a high and reproducible accuracy rate of over 95%. Benjamin Q. Huynh presented

an improved deep learning model in [20] based on CNN and applied it on digital mammographic

tumor detection. The concept of transfer learning has also been proposed and used in this paper.

[21] presented a 3D CNN model for the diagnosis of attention deficit hyperactivity disorder

(ADHD) through the recognition and analyes of the local spatial patterns from brain MRIs. As for

lung disease detection, deep CNN model with dense structure has been developed and employed

in [22] for the diagnosis of interstitial lung diseases from CT image. The proposed model can

distinguish six different disease manifestations together with healthy cases with an accuracy of

over 80%. [23] compared the performance between CNN and traditional classification method

SVM on the recognition of five interstitial lung diseases based on the CT image data and the region

of interest provided by radiologists. Results showed that CNN presents a significantly improved

accuracy and greater efficiency compared to the SVM classifier. In [24], Jaiswal et al. implemented

a Mask-RCNN model which combines both the global and regional features on the identification

of pneumonia. Their approach achieves a diagnostic accuracy of over 90% and a visible

localization of the disease is given using the regional information extracted from the bounding

box.

Apart from lung diseases with single manifestations, CNN also provides a good ability in the

diagnose of TB. In [25], two deep CNN models, AlexNet and GoogLeNet, have been applied to

classify the chest radiographs provided by Shenzhen Hospital Chest X-Ray dataset and

Montgomery County Chest X-Ray dataset as pulmonary TB or healthy cases. In this study, datasets

have been split into training, validation and testing sets, with a proportion of 68.0%, 17.1% and

14.9% respectively. Receiver operating characteristic curves (ROC) and areas under the curve

(AUC) is used to analyze the overall performance statistically. In their experiments, the best

10

classifier achieves an impressing AUC of 0.99. Later on, Pasa et al. in [26] presented the automated

diagnosis as well as the localization of TB on the same two datasets using a deep CNN with the

shortcut connection. The best AUC achieved in their experiment is 0.925, not as good as the

previous work, but localization result they provide is quite impressive. Considering the amount of

data is small and not representative enough in the previously used two datasets, Hwang et al. [27]

expand their research from these 2 datasets to a larger dataset which contains over 10,000 images.

During their experiment for diagnosing TB, they achieved an accuracy of 83.1%. 83.4% and 83.0%

on the 3 datasets respectively. However, in all of these 3 works, none of them test the model’s

performance on diagnoses among multiple TB-related diseases, which is a more difficult and

practical task that need to be solved. In 2017, [28] proposed a 121-layer dense CNN architecture

and tested the model by training with the currently largest publicly available chest radiography

dataset, NIH ChestX-ray14 dataset, to detect over ten different lung diseases. The performance

achieved by the CNN model has been compared to that by radiologists and the result measured by

F1 metric is more superior compared to human experts. According to their result, the proposed

model achieves an F1 score of 0.435 which is much higher compared to the average performance

given by human experts of 0.387. Yet, without providing the classification accuracy for each

disease, the robustness of their result is unknown.

Most of the work mentioned above focus on introducing the general applications of CNN on

disease diagnosis via the direct implementation as well as improving the detection accuracy by

employing new methods on certain stage of the classification process or by making changes to the

CNN model structures. Shin discussed using off-the-shelf pre-trained CNN features from natural

image dataset and then trained with medical images, fine-tuning the model to complete lung

disease diagnosis tasks in [29]. The discussion of major techniques and the process of applying

11

CNN to medical image recognition as well as the demonstration of potential advantages in

applying transfer learning inspired us. In our study, we explore three important but previously

understudied aspects, the inner structure of the CNN model, the learning and fine-tuning process

of CNN models pre-trained from natural image dataset using provided medical image data and the

organization of the outputs given from the trained CNN models, to improve the stability and overall

performance on TB diagnosis. To evaluate the performance of classifiers [30] recommends a

comprehensive assessment via accuracy, precision, recall, F-measure and AUC. For the diagnosis

of multiple diseases, the confusion matrix is recommended in [31] to generalize intuitive

classification results.

In image classification tasks, the first problem needs to be considered is image processing, a basic

but critical data preprocessing step to improve the overall quality of the data. [32] introduces

various histogram-based image enhancement methods for the processing of medical images which

includes histogram equalization (HE) and contrast limited adaptive histogram equalization

(CLAHE). CLAHE has been applied in [33] for the detection of pulmonary Tuberculosis via CNN

and received a superior result compared to the state-of-the-art.

In [34], metaheuristic algorithms have been introduced as modern optimization techniques. Three

metaheuristic approaches, simulated annealing, differential evolution, and harmony search have

been implemented on CNN models to classify hand-written digits and pictures of daily objects.

Compared to the CNN models trained by original optimizers, the proposed method shows an

improvement of accuracy up to 7.14%. [35] presents an introduction of the artificial bee colony

(ABC) algorithm, a metaheuristic method inspired by the foraging behavior of honeybees, and its

advantage on edge detection in CNN based image classification. Moreover, research on applying

metaheuristic algorithms for mass detection from mammograms have been done in [36] and

12

received improved classification results. From these studies, metaheuristic algorithms have

become a possible solution to improve the performance of CNN models in medical fields.

When we move our attention to further improve the overall performance as well as the stability of

CNN models by organizing the generated outputs, Islam’s paper [37] gets our attention. This paper

shows that the ensemble of multiple CNN models has an inherent advantage of constructing a non-

linear decision-making system and lead to a promising improvement in visual recognition.

Moreover, they also present the lung disease localization using heat map obtained from the

occlusion sensitivity. [38] proposed an object detection method using class activation mapping

which effectively makes use of the trained parameters from the CNN model and gives the location

of the detected object on the given image to make the results given by a CNN model easier to be

understood.

In our study, we propose a computer-aided detection model for TB diagnosis with various

improvements based on three aspects which correspond to the model structure as well as the

learning process of CNN models. We cover layer modifications, model fine-tuning via artificial

bee colony (ABC) algorithm and overall performance improvement by CNN model ensemble

method. The exploration of how class activation mapping works on visualizing the location of TB

manifestations for single and the ensembled CNN models is provided for disease localization task.

13

Chapter 3

Chest X-Ray Datasets and Image Enhancement Methods

3.1 Dataset Selection

3.1.1 Montgomery County Chest X-Ray Dataset

The Montgomery County Chest X-Ray dataset [39] is created by the National Library of Medicine

together with the Department of Health and Human Services, Montgomery County, Maryland,

USA, from patients who joined the Tuberculosis (TB) screening program. This dataset contains

138 frontal posteroanterior CXR images in Portable Network Graphics (PNG) format, in which 80

images belong to normal cases and the other 58 are diagnosed as having TB manifestation. Images

in this dataset are 12-bit grey level images with the size of either 4020 4892 or 4892 4020

pixels. Some sample raw images in this dataset are given in Figure 3-1.

Figure 3-1: Sample Raw Images in Montgomery Count CXR Dataset

The image file names are coded as ‘MCUCXR_****_X.png’, where ‘****’ represents a unique

ID number for each patient, and X can be either 0 or 1 which represents the normal or abnormal

case respectively. The clinical readings that include patient’s age, sex and abnormality information

MCUCXR_0060_0.png MCUCXR_0063_0.png MCUCXR_0243_1.png MCUCXR_0387_1.png

14

are saved as text files with the same image file names. Below give some examples of clinical

readings.

Figure 3-2: Sample Clinical Readings for CXR Images

Before the experiments, images which have large black background have been cropped and all

images have been resized to 512 512 pixels.

3.1.2 Shenzhen Hospital Chest X-Ray Dataset

The Shenzhen Hospital Chest X-Ray dataset [39, 40, 41] is created by the National Library of

Medicine, Maryland, USA in collaboration with Shenzhen No.3 People’s Hospital and Guangdong

Medical College in China. This dataset contains 662 frontal posteroanterior CXR images in various

sizes, among which 326 are diagnosed as normal cases and 336 are the cases with TB

manifestation. Since it has been created by the same institution as Montgomery County Chest X-

Ray dataset, the image format, file names and clinical readings follow the same rule.

Figure 3-3 presents some sample raw images in Shenzhen Hospital Chest X-Ray dataset.

Figure 3-3: Sample Raw Images in Shenzhen Hospital CXR Dataset

Patient's Sex: F Patient's Age: 027Ynormal

Patient's Sex: M Patient's Age: 016Yimproved LUL infiltrate and cavity. Active TB on therapy.

Normal Case Abnormal Case

CHNCXR_0005_0.png CHNCXR_0021_0.png CHNCXR_0552_1.png CHNCXR_0562_1.png

15

Unlike Montgomery County Chest X-Ray dataset, no images in this dataset contain the large black

background. Thus, they have been resized to 512 512 pixels without cropping before the

experiments.

3.1.3 NIH Chest X-Ray8 Dataset

The NIH ChestX-Ray8 dataset [42] is so far one of the largest public chest x-ray databases for

thorax disease detection study purposes. This dataset is extracted from the clinical PACS database

at National Institutes of Health Clinical Center and contains 112,120 frontal view chest x-ray

images of 30,805 unique patients with 14 thoracic pathologies (atelectasis, consolidation,

infiltration, pneumothorax, edema, emphysema, fibrosis, effusion, pneumonia, pleural thickening,

cardiomegaly, nodule, mass and hernia). All images in this dataset have already been preprocessed

to the same size of 1024 1024 for convenient purposes. Figure 3-4 presents some sample raw

images in this dataset which includes normal as well as TB related cases.

Figure 3-4: Sample Raw Images in NIH ChestX-Ray8 Dataset

00000029_000.pngNo Finding

00020471_002.pngConsolidation

00000144_003.pngEffusion

00019683_000.pngFibrosis

00001992_007.pngInfiltration

00002419_003.pngMass

00000971_002.pngNodule

00000280_003.pngPleural Thickening

16

Clinical readings which contain patient’s ID, follow-up number, age, sex, view position,

abnormality information have been organized by image names and saved in one Comma Separated

Values (CSV) file called ‘Data_Entry_2017’.

Although the NIH ChestX-Ray8 dataset has been widely used among researchers into the deep

learning area for medical purposes because of its large amount of data, detailed annotations and

wide range of thorax diseases it covered, there are still some problems.

The first and biggest problem is that the quality of the chest x-ray images varies a lot which greatly

increases the workload of data cleaning. Images with side views, images that do not contain much

useful information at lung part, rotated images and images with bad pixel quality are all need to

be removed at the beginning. Otherwise, these ‘bad data’ will inference the learning process of the

deep learning models and therefore influence the overall performance on disease diagnosis.

Sample images that contain the previously mentioned problems are given in Figure 3-5.

Sample images with side view Sample images do not contain much lung part

17

Figure 3-5: Sample images with bad quality in NIH ChestX-Ray8 dataset

Another problem is that, according to [43], since the original radiology report is not anticipated to

be publicly shared, disease info and labels for images are text-mined via natural language

processing techniques with an accuracy over 90%. The mismatched labels and images will also

bring difficulties for researchers, especially those without any medical background.

Moreover, problems such as the greatly unbiased number of images for each thorax disease,

different characteristics of the two view positions (posteroanterior and anteroposterior) etc. are all

need to be considered.

According to the experimental needs, only PA images with no finding labels and the TB related

manifestations (consolidation, effusion, fibrosis, infiltration, mass, nodule, pleural thickening) are

selected. Among them, images with bad quality as stated above have been removed, and the rest

have been resized to 512 512 pixels before the experiment.

Sample rotated images Sample images with bad pixel quality

18

3.2 Image Enhancement Methods

Image enhancement plays an essential role in image processing fields. In the process of image

formation, transmission and transformation, the image quality might be reduced with blurry

features due to various external factors, which makes the image recognition and analysis work

greatly difficult. Hence, attenuating the unnecessary features while highlighting the necessary ones

based on specific needs to improve the visibility of images becomes the main research content of

image enhancement.

In medical image-based disease diagnosis, doctors make judgements based on some specific

features displayed in the image. Generally, the human eye is sensitive to high-frequency signals

that contain most of the detail information. However, in medical images, high-frequency signals

are often embedded in a large amount of low-frequency background signals and thereby reduce

their visibility. Therefore, to better-facilitating disease diagnosis, it is possible to enhance the

contrast by appropriately increasing the high-frequency portion of the image via image

enhancement methods.

3.2.1 Histogram Equalization (HE)

The main idea of histogram equalization (HE) is to change the pixel histogram of the original

image from a concentrated grayscale range to a uniform distribution in the whole range [44]. This

method includes the following steps:

(1) Count the total number of pixels, 𝑛𝑖, in each grayscale level from the input image, where 𝑖 =

0, 1, … , 255 is the possible pixel values of a grayscale image.

19

(2) Calculate the cumulative distribution function: 𝑃(𝑗) = ∑ 𝑃(𝑘),𝑖𝑘=0 𝑖 = 0, 1, … , 255 . This

function is guaranteed to be a monotonically increasing function with the consistency of the

dynamic range of the grayscale values during the image transformation.

(3) Obtain the equalized pixel values using the function 𝑗 = 𝑃(𝑗) × 256 + 0.5 and round this

value to its closest integer.

(4) Finally, replaced each pixel with its corresponding equalized value to obtain the equalized

image.

By performing the nonlinearly stretch and pixel value redistribution as described above, the

number of pixels in a certain grayscale range will be approximately the same. This more even

distribution of pixel values over the histogram helps to increase the background contrast and

therefore improves the visual effect of the image.

3.2.2 Contrast Limited Adaptive Histogram Equalization (CLAHE)

As the combination of contrast-constrained method and adaptive histogram equalization method,

contrast limited adaptive histogram equalization (CLAHE) [45] overcomes the problem exists in

HE method that it cannot perform the optimization of the local image contrast.

The main idea of CLAHE is to perform HE on the input image through a sliding window and

combine the histograms inside and outside the window, the height of the resulted histogram is then

controlled by the clip limit value to suppress the background noise from being excessively

amplified. However, since the processing of the image sub-regions may lead to an unevenly

distributed pixel, interpolation is required in each sub-region at the very last step.

Figure 3-6 presents a sample CXR and its enhanced images via HE and CLAHE, the corresponding

distribution of the pixel value are also provided via histogram for statistical comparison.

20

Figure 3-6: Sample CXR and its enhanced results together with the corresponding histogram

As shown in the picture above, the CXR image enhanced by HE has a histogram with a more

equalized distribution of pixels over the total grayscale range compare to the one from raw CXR.

Raw Input Image and Its Histogram

Image Enhanced by HE and Its Histogram

Image Enhanced by CLAHE and Its Histogram

21

The enhanced image presents a prominent view of the overall texture and blood vessels. However,

the background noise in the CXR has also been significantly enhanced in this case. Meanwhile,

CXR enhanced by CLAHE has the histogram that not only has a uniformed distribution but also

maintains the same trends of the concentration as in the raw CXR. The local details in the

corresponding enhanced CXR become clearer while the background noise has been suppressed so

that the total quality of the CXR image is increased.

Therefore, CXRs are all enhanced by CLAHE with the clip limit number equals to 1.25 before the

experiment.

22

Chapter 4

Deep Learning Models for Image Classification

4.1 Deep Learning

As a new research field in machine learning, the concept of deep learning was first proposed by

Geoffrey Hinton et al. in [46] in 2006 to learn data features as well as find better data expression

through multi-level structure and layer-wise training. The main idea is to recognize all kinds of

data such as image, text, sound etc. in the way of simulating the human brain to learn things

through a multi-layer network structure. Unlike traditional machine learning methods, deep

learning can integrate feature extraction and categorical regression into one model and thus greatly

reduce the work of artificial feature design.

Deep learning models can be classified into 3 categories based on their structures and applications:

generative deep architectures, discriminative deep architectures and hybrid deep architectures [47].

Generative deep architectures such as Deep Boltzmann Machine (DBM) and Deep Belief Network

(DBN) are used to describe the high-level correlation among data. Category of the observation

samples is obtained through joint probability distribution to better calculate both prior and

posterior probability. Discriminative deep architectures are generally used in classification

problems to describe the posterior probability of data, one typical example is the Convolutional

Neural Networks (CNN). Hybrid deep architectures such as Recurrent Neural Networks (RNN)

combines the features from both generative and discriminative structures. While solving the

23

classification problem, they make full use of the output from the generative architecture to simplify

as well as optimize the whole model.

The superior ability to extract global features and context information from the inherent

characteristics of data makes deep learning the first choice for solving many complicated

problems. Scholars have now carried out remarkable researches in this field and proposed a large

number of efficient models that can be directly used and improved according to the specific needs.

4.2 Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN) is a discriminative classifier developed from multi-layer

perceptron. In other words, given labelled training examples, the algorithm will output the

predicted probabilities of each existed categories for new data. It is designed to recognize specific

patterns directly from image pixels with minimal pre-processing. Due to the ability of proceeding

shift-invariant classification based on its hierarchical structure, CNN is also known as Shift-

Invariant Artificial Neural Networks (SIANN) [48].

The earliest study of CNN starts in 1959 by two neuroscientists, Hubel and Wiesel. During their

experiments on cats’ visual cortex, they found that neurons used for local sensitivity and direction

selection could effectively reduce the complexity of the feedback neural network, and thus

proposed the concept of the receptive field. Later in 1980, inspired by their idea, Japanese scholar,

Kunihiko Fukushima, proposed the neocognitron model [49] which decomposes a complex visual

pattern into many sub-patterns (as known as features) and proceeds with a series of hierarchically

connected feature planes. This model can be regarded as the first implementation of CNN. It

attempts to imitate the visual system so that objects with displacement or slight deformation will

still be recognized. In 1998, a multi-layer CNN structure constructed by Yann LeCun et al. [50],

24

LeNet5, achieved great success in the task of recognizing handwritten digits. Since 2012, CNN

starts to receive great attention from researchers and has been widely applied in the computer

vision field.

With the consistent improvement of the computing power on computer chips, the generalization

of GPU clusters, as well as the appearance of the studying on various optimization theories and

fine-tuning technologies, new CNN models have been continuously proposed with the

improvement in structures and better performance. Figure 4-1 illustrates the development of CNN

models.

Figure 4-1: CNN structural evolution map

4.2.1 Basic CNN Structure

A complete CNN architecture is mainly composed of convolutional layers, pooling layers, and

fully-connected layers, as shown in Figure 4-2.

Hubel &

Wiesel

Neocognitron

LeNet

AlexNet

VGG16 VGG19 MSRANet

Network In

Network

GoogLeNet

Inception

V1, V2

GoogLeNet

Inception

V3, V4

ResNetInception

ResNet

Early Attempts

Dropout,

ReLU

Historical

Breakthrough

Deepening the network

Improving the convolution module

Integration of the 2 lines, with the

ability of training deeper network

and accelerating convergence.

25

Figure 4-2: CNN structure based on LeNet-5

(1) Convolutional Layer

The convolutional layer is the core building block on CNN. The main idea is to extract patterns

found within the local regions of the input image that are common throughout the dataset [51].

Figure 4-3: Process of convolution

As illustrated in Figure 4-3, the convolution operation is processed by sliding the kernel across the

input from left to right, top to bottom to generate the feature maps by performing the following

calculation:

kernel

w2

w3 w4

x1,1 x1,2 x1,3

x2,1 x2,2 x2,3

x3,1 x3,2 x3,3

w1

input output

3*3 input convolved

with a 2*2 kernel

W1X1,1+W2X1,2

+W3X2,1

+W4X2,2

W1X1,2+W2X1,3

+W3X2,2

+W4X2,3

W1X2,1+W2X2,2

+W3X3,1

+W4X3,2

W1X2,2+W2X2,3

+W3X3,2

+W4X3,3

26

(𝑥𝑊)𝑚,𝑛 = ∑ ∑ 𝑥𝑚+𝑥−1,𝑛+𝑦−1𝑊𝑥,𝑦

𝑤

𝑦=1

ℎ

𝑥=1

ℎ𝑖 = 𝜎(𝑥𝑊𝑖 + 𝑏𝑖)

where 𝑊𝑖 is the share weight vector of size 𝑤 × ℎ in layer 𝑖, and 𝑏 is the shared value of bias. ‘’

denotes the convolution operation, that is summing the elementwise products between 𝑊𝑖 and its

corresponding pixel values from the output of layer 𝑖 − 1. σ represents the activation function,

which mainly focuses on adding non-linear factors to the model to better solve more complex

problems. Most commonly used activation functions are Sigmoid, Tanh and Rectified Linear Unit

(ReLU).

In essence, outputs obtained from the convolutional layer is the combination of features extracted

from the receptive field with their relative position remained. These outputs may further be

processed by another layer with higher-level weight vectors to detect larger patterns from the

original image. During the convolution process, the shared weight vector provides a strong

response on short snippets of data with specific patterns.

(2) Pooling Layer

Another important concept in CNN is the pooling layer, which normally been placed after the

convolutional layer and provides a method of non-linear down sampling. It divides the output from

the convolutional layer into disjoint regions and provides a single summary for each region to

obtain the characteristics of convolution.

Classic pooling methods include max pooling, mean pooling and stochastic pooling. Figure 4-4

illustrates its working principles.

27

Figure 4-4: Classic pooling working principles

Max-pooling takes the maximum value in the local receptive region while mean pooling averages

all those values. Stochastic pooling [52] assigns each sample point in the locally receptive region

a probability value and then select a value randomly based on their probabilities. Additional

pooling methods such as adaptive pooling, mixed pooling, spectral pooling, spatial pyramid

pooling, etc. are also proposed based on specific needs [53].

By extracting the desired features from the local area, pooling operation increases the tolerance of

distortion and displacement to improve fault tolerance. Moreover, the use of pooling greatly

reduces the spatial size of data and thus improve the computational efficiency.

(3) Fully Connected Layer

Before getting the classification result, there is usually one or more fully connected layers placed

at the very end of a CNN model. Same as layer structures in neural network, fully connected layer

1 1 2 4

5 6 7 8

3 2 1 0

1 2 3 4

0.15 0.30 0.05 0.20

0.20 0.35 0.35 0.40

0.20 0.20 0.10 0.40

0.20 0.40 0.30 0.20

6 8

2 0

6 8

3 4

3.25 5.25

2.00 2.00

Kernel Size: 2*2 Stride:2

Average Pooling

Stochastic Pooling

ProbabilityMatrix

Max Pooling

28

is composed of a certain number of disconnected neurons, neurons between adjacent layers are

fully connected (see Figure 4-5).

Figure 4-5: Fully connected layer neuron schematic diagram

This fully connected layer structure develops a shallow multi-layer perceptron which aims at

integrating the previously extracted local feature information with categorical discrimination to

classify the input data.

4.2.2 CNN Working Mechanism

CNN uses existing samples and their corresponding labels to train the model so that parameters

such as weights and bias can be adjusted through backpropagation of loss during the training

process to improve the classification accuracy.

The training process is as follows:

(1) Initialized all parameters of the model to smaller random numbers.

(2) Randomly pick n samples from the training data, ((𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑛, 𝑦𝑛)), and feed

them into the model. Where 𝑥𝑖 , 𝑖 = 1,2, … , 𝑛 represents the sample data, 𝑦𝑖 ∈ {1, 2, … , 𝑘},

𝑖 = 1, 2, … , 𝑛 is one of the 𝑘 categories corresponding to the sample data, also known as the

expected label of the 𝑖-th sample.

fully-

connected

layer 1

fully-

connected

layer 2

fully-

connected

layer 3

Fully-Connected Layer Structure

.

.

.

X1

X2

Xn

Bias

Output

W1,j

W2,j

Wn,j

input

(output from

previous layer)

weighted sumactivation

function

Processing of Neurons with Related Parameters

29

(3) Propagate input data forward the model layer by layer and obtain the predicted classification

results given by the model. In CNN, the most commonly used classifier to generate the result

is SoftMax classifier. The calculation is defined as follows:

ℎ𝜃(𝑥𝑖) = [

𝑃(𝑦𝑖 = 1|𝑥𝑖, 𝜃)

𝑃(𝑦𝑖 = 2|𝑥𝑖, 𝜃)⋮

𝑃(𝑦𝑖 = 𝑘|𝑥𝑖 , 𝜃)

] =1

∑ 𝑒𝜃𝑗𝑇𝑥𝑖𝑘

𝑗=1[ 𝑒

𝜃1𝑇𝑥𝑖

𝑒𝜃2𝑇𝑥𝑖

⋮

𝑒𝜃𝑘𝑇𝑥𝑖]

where θ denotes the parameters involved in the CNN model. This equation scales the output

of the resulted 𝑘-dimensional vector to numbers between 0 and 1 with the total sum equals to

1. Therefore, each element of this vector is also known as the probability of the input that

belongs to its corresponding class. Element with the highest probability will be selected and

the class corresponds to this probability will be assigned to the input image as a final decision

of the model.

(4) Compare the predicted class label, �̂�𝑖 , with the expected label 𝑦𝑖 , and calculate the cross-

entropy cost function using the following equation:

J(𝑊, 𝑏) = −1

𝑛∑ [𝑦𝑖𝑙𝑜𝑔�̂�𝑖 + (1 − 𝑦𝑖)𝑙𝑜𝑔(1 − �̂�𝑖)]

𝑛

𝑖=1

where 𝑊 indicates the weights and 𝑏 represents the bias.

For all training samples, if the predicted value is close to the expected value, the cross-entropy

will be close to 0.

(5) Compute the gradients of 𝑊 and 𝑏 of the cost function using the following formulas so that

the weight and bias that contributed most to the loss will be obtained.

𝜕𝐽

𝜕𝑊𝑖=

1

𝑛∑

𝜎′(𝑧)𝑥𝑖

𝜎(𝑧)(1 − 𝜎(𝑧))

𝑛

𝑖=1(𝜎(𝑧) − 𝑦)

𝜕𝐽

𝜕𝑏𝑖=

1

𝑛∑ (𝜎(𝑧) − 𝑦)

𝑛

𝑖=1

30

where 𝜎(𝑧) is the activation function, 𝜎(𝑧) − 𝑦 indicates the error between the predicted and

expected value. Therefore, as the error is getting larger, the gradient will keep increasing and

the parameters will be adjusted at a faster speed.

Once done computing the gradients, weights and biases are updated as follows so that the total

cost decreases.

𝑊𝑖 = 𝑊𝑖 − 𝜂𝜕𝐽

𝜕𝑊

𝑏𝑖 = 𝑏𝑖 − 𝜂𝜕𝐽

𝜕𝑏

𝜂 in the above equation is the learning rate, a hyperparameter that is normally set manually.

High learning rate indicates taking a larger step during the update of parameters and will result

in taking less time for the model converging to an optimal value. However, if the step is too

large, the updated values will jump too far each time so that the result optimal point is not

accurate enough. On the contrary, if the learning rate is too low, it will take the model a long

time to converge. Therefore, the selection of the learning rate should neither be too high nor

too low.

4.2.3 Explicit Training of CNN

During the training session of CNN, the input data volume and parameters updating method are

the main factors that influence the whole process. The most commonly used training methods are

batch training, stochastic gradient descent and mini-batch training.

In batch training, all data will be feed into the model to calculate total gradient increment, this

value will be further used in updating parameters. This method reduces the number of updates of

model parameters and thus decreases the calculation cost of the model as well as shorten the

31

training time. Since each value update involves the gradient from all input data, parameters will

move faster towards the direction in which the cost function drops. Moreover, it helps to avoid

over-fitting caused by training on a small number of samples. However, averaging the entire

samples to get the gradient decreases the impact of changes provided by parameters, and thus more

training epochs is required. This problem is catastrophic for large data sets.

Stochastic gradient descent method feed randomly disordered data one at a time into the model

during training to generate the gradient and update parameters for each data individually.

Compared to batch training, this algorithm greatly reduces the number of training iterations.

However, since each process of updating the parameters relies on a single data sample, the whole

model will tend to better optimize the individual sample rather than the general data and will thus

be easily influenced by the problem of over-fitting.

Mini-batch training is proposed by combining the advantages of both batch training and stochastic

gradient descent. This method divides the randomly disordered data into small batches and

calculates the gradient for each batch to carry out parameter updating. Since the datasets used for

deep learning tend to be relatively large, it has become the most commonly used training method.

4.3 Significance of Applying CNN in Medical Image Recognition

In recent years, with the improvement of the radiology medical equipment and the increasing

number of daily diagnostics, thousands of medical images are produced in hospital every day,

which incredibly increases the workload of film reading doctors.

Traditional computer-aided detection (CAD) system uses machine learning methods such as

support vector machines (SVM), K nearest neighbors (KNN) etc., to help radiologists improve the

diagnostic accuracy. However, most of these methods need to manually extract disease features.

32

With the various and ever-changing features of lesions, features extracted in previous may not

apply to new patient data. Therefore, traditional machine learning methods are not suitable as a

long-term effective solution.

With the ability to extract complex pathological features automatically from the data and the

intrinsic requirement of a large dataset, the application of CNN in various diagnostic modalities

turns to be an efficient solution.

4.4 Advanced CNN Models Used in The Experiment

4.4.1 VGGNet

VGGNet is a deep CNN model proposed by researchers from Visual Geometry Group in Oxford

University and Google’s DeepMind branch which aims at exploring the relationship between the

depth of CNN and its performance.

Compare to classic CNN structures, the most prominent feature of VGGNet is the increasing of

model capacity and complexity by repeatedly stacking small convolution kernels with size 33

and maximum pooling kernels with size 22. Moreover, [54] demonstrated that the superimposing

of multi-level convolution kernels will reduce the number of parameters in the model and thus

reduce the amount of calculation. For example, for a layer that has both the input and output with

C channels, the number of parameters required using a 77 convolution kernel should be

77CC = 49C2. However, with the stacking of the 3-layer convolution kernel, total parameters

needed will be reduced to 333CC = 27C2.

33

With the deeper and more complex structure that better extract visual data representations in a

hierarchical way, VGGNet reduced the number of iterations required for convergence significantly

as well as the error rate.

The VGGNet models used in our experiment are pre-trained VGG16 and VGG19 on ImageNet

Dataset, model structures and parameters are all provided by Torch 0.3.1.post2.

4.4.2 GoogLeNet Inception Model

The GoogLeNet Inception Model was firstly proposed by Christian Szegedy et al. in [55] and later

got improved in [56]. The highlight of this model is the introducing of inception modules with the

dense structure to approximate a sparse matrix, which improves the efficiency by extracting more

features under the same amount of computation.

Before that, CNN models tend to achieve better training results by simply increasing the depth of

the network. However, problems such as overfitting on small datasets and the escalation of

computational complexity are raised as the number of layers keep increasing. The inception

modules (see Figure 4-6) solve these problems via the use of multiple kernels that capture the

receptive fields in various sizes and the use of bottleneck layer with size 11 to shrink the number

of channels.

34

Figure 4-6: Inception model with dimension reduction

The GoogLeNet Inception model used in our experiment is the pre-trained Inception V3 on

ImageNet Dataset, model structures and parameters are all provided by Torch 0.3.1.post2.

4.4.3 ResNet

As the most popular CNN model use in various computer vision fields, ResNet was proposed in

[57] in 2015 with the implementation of residual blocks (see Figure 4-7) in original CNN models,

which has the characteristics of keeping the structure simple but deep as well as increasing the

accuracy.

35

Figure 4-7: Shortcut connection of the residual block

The residual blocks shown above contains two mappings, identity mapping and residual mapping.

By establishing a direct connection between the input and output through these two mappings, the

latter layer only needs to learn new features based on the residuals from the input so that even if

the residual reduces to 0 as the model goes deeper, the identity mapping remains, which keeps the

network stay in optimal state without losing the gradient. This effectively solves the gradient

dispersion problem existed in deep CNN models.

The ResNet models used in our experiment are the pre-trained ResNet34, ResNet50 and

ResNet101 on ImageNet Dataset, model structures and parameters are all provided by Torch

0.3.1.post2.

weight layer

weight layer

ReLU

ReLU

X

Xidentity

F(X)

F(X) + X

36

Chapter 5

TB Detection via Improved CNN Models and Artificial Bee Colony

Fine-Tuning

5.1 Transfer Learning

Transfer learning is a process that focuses on storing the knowledge learned in solving one problem

and applying it to a correlated task [58]. This method aims at leveraging previous learnings and

building accurate models for specific tasks in a more efficient way [59].

In the computer vision field, models with deep and complicated structures are expensive to train

because of the requirement on a large dataset and expensive hardware such as GPU. Moreover, it

takes several weeks or even longer to train a model from scratch. Thus, the use of a pre-trained

model with the developed internal parameters and the well-trained feature extractors helps in

increasing the overall performance of the model for solving similar problems with relatively

smaller datasets.

Considering the advantages of transfer learning, the CNN models that we have been implemented

during the experiment have all been pre-trained on the ImageNet dataset [60], a very large dataset

which contains millions of images in over one thousand categories. Models with the previously

pre-trained parameters are then been modified and trained on CXR datasets for TB diagnosis.

37

5.2 Modifications of Advanced CNN Models

5.2.1 Modifications on CNN Architectures

To further boost the performance of the pre-trained models and better utilize the developed internal

parameters for TB diagnosis, some slight modifications are made on the last few layers of the

original advanced CNN models.

Figure 5-1 presents the diagram of the modifications that are made on the original CNN

architecture.

Figure 5-1: Modifications on CNN architecture

At first, the very last pooling layer has been changed from the default set of either max or average

pooling into the parallel connection of adaptive max and average pooling, an effective pooling

method that is uniquely provided by PyTorch and will automatically control the output size based

on the input parameters. This collection of both maximized and averaged feature maps helps

collect more high-level information learnt from the task dataset which will generate more useful

and comprehensive details for further prediction. After adaptive pooling, two fully connected

CNN

Model

Main Part

Pooling:

Max/Avg

Fully

Connected

LayerOutput

Adaptive

Max

Pooling

Adaptive

Avg

Pooling

Fully

Connected

Layer I

Fully

Connected

Layer II

batch norm

dropout

batch norm

dropout

replaced by

Original Structure

Modified Structure

38

layers are added before the final output as a deep neural network structure to better capture and

organize the high-level information and improve the overall performance. Moreover, batch

normalization [61] is implemented in fully connected layers to improve the efficiency as well as

eliminate the internal covariate shift of the activation values in feature maps so that the distribution

of the activations remains the same during training. Also, dropout is added so that a certain number

of neurons in each layer will be randomly dropped along with all their associated incoming and

outgoing connections during training to avoid overfitting that might be caused by this deepened

and more complicated structure.

5.2.2 Model Division with Different Learning Rates

Model division with different learning rates aims at dividing the layers within a CNN model into

various groups (three in our experiment) and distributing different learning rates for each group

for training. The general idea has been illustrated in Figure 5-2.

Figure 5-2: CNN division with different learning rates

The first few layers of the CNN model mainly focus on extracting generic features (edges, shapes,

textures etc.) that help to identify the basic information exist in every image and thus need very

less training during the transfer learning process. Layers in the middle part of a CNN model will

learn more complex features and specific details concerning the dataset on which it is trained.

Knowledge learned from this stage will have a direct impact on the result of the specific task

smaller learning rate larger learning rate largest optimal learning rate

Output

Group I Group II Group III

A A A B B B C C C

39

related to the training dataset with a slightly higher learning rate. The last few layers organize all

the features extracted from the previous layers to recognize different target objects (animals,

vehicles, tumors etc.) as so to generate the final result. Since these layers have the most correlation

to the target training dataset and will trade the previous feature maps for something more aligned

to the target task, this is where we would like the bulk of the training to happen. Therefore, this

part of a CNN model is where we set the learning rate to be the highest among all three parts.

In general, during transfer learning, model division with different learning rates will help CNN

models better adapt to the target problem since it has already learnt something more general and

thus will perform well on the new problem effectively and efficiently. As for how less or more the

learning rates for the first and middle part of a CNN model depends on the data correlation between

the pre-trained model and target model.

5.3 Fine-Tuning the CNN Model via Artificial Bee Colony Algorithm

5.3.1 Artificial Bee Colony

Artificial Bee Colony (ABC) is a metaheuristic algorithm that was proposed by Karaboga in [62]

in 2005. It was inspired by the foraging behaviour of honeybees and has been abstracted into a

mathematical model to solve multidimensional optimization problems [63]. This algorithm

represents solutions in multi-dimensional search space as food sources (nectar) and maintains a

population of three types of bee (scout, employed, and onlooker) to search for the best food source

[64].

In the early stage of food-collecting, scout bees go out to find food sources by either exploring

with the prior knowledge or random search. The scout bee will then turn into an employed bee

after finishing the searching task. Employed bees are mainly in charge of locating the food source

40

and collecting nectar back to the hive. After that, based on specific requirements, they will proceed

with the selection from continuing collecting nectar, dancing to attract more peers to help or give

up the current food source and then change their roles to scout or onlooker bees. Onlooker bees’

job is to decide whether participating in nectar collection based on the dance performed by

employed bees.

Assume that the problem domain has a dimension of D, the position of a food source, 𝑖, which is

the parameter that needs to be optimized, will be represented as:

𝑋𝑖 = [𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝐷], 𝑥𝑖𝑑 ∊ (𝐿𝑑 , 𝑈𝑑)

where 𝐿𝑑 and 𝑈𝑑 represents the lower and upper limits of the search space respectively, and 𝑑 =

1,2,3, … , 𝐷 indicates the dimension number.

The location of the food source will be initialized according to the following equation:

𝑥𝑖𝑑 = 𝐿𝑑 + 𝑟𝑎𝑛𝑑(0,1)(𝑈𝑑 − 𝐿𝑑)

During the searching step, nearby food locations, 𝑉𝑖 = [𝑣𝑖1, 𝑣𝑖2, … , 𝑣𝑖𝐷], will be generated around

the existed food source by employed bees’ exploration via:

𝑣𝑖𝑑 = 𝑥𝑖𝑑 + 𝜑(𝑥𝑖𝑑 − 𝑥𝑗𝑑), 𝑗 ≠ 𝑖

where 𝜑 is a random number that lays in the interval [0, 1] with a uniform distribution.

Fitness value will then be obtained by performing the non-linear transformation of the target

function 𝑓𝑖(𝑋𝑖) as follows, to help decide whether to replace the original food source 𝑋𝑖 with the

newly explored location 𝑉𝑖 through greedy selection:

41

𝑓𝑖𝑡𝑖(𝑋𝑖) = {

1

(1 + 𝑓𝑖(𝑋𝑖)), 𝑖𝑓 𝑓𝑖(𝑋𝑖) ≥ 0

1 + 𝑎𝑏𝑠(𝑓𝑖(𝑋𝑖)), 𝑖𝑓 𝑓𝑖(𝑋𝑖) < 0

After that, employed bees will fly back to the hive and share the information on the food source.

Onlooker bees will then decide whether to follow the employed bee and collect food based on the

calculated probability:

𝑝𝑖 = 𝑓𝑖𝑡𝑖

∑ 𝑓𝑖𝑡𝑖𝑁𝑃𝑖=1

⁄

where 𝑁𝑃 indicates the total number of the discovered food source.

The selection of employed bee is executed via roulette, during which a random number between 0

and 1 is generated, if the number is less than 𝑝𝑖, the onlooker bee will follow an employed bee by

repeating its job on generating a new food source around 𝑋𝑖 and then proceed with the greedy

selection.

If the number of rounds in which a food source is mined reaches to the threshold after 𝑡 iterations,

this food source will be abandoned. The employed bees will then turn into scout bees and continue

exploring new food sources to replace the old one.

Figure 5-3 presents a flowchart of how the ABC algorithm works to solve optimization problems.

42

Figure 5-3: Flowchart of artificial bee colony algorithm

Scout bees

searching for initial

food source

No

Yes

Start

Exist food source

that reaches to maximum

search limit

Initialize bee

population

Employed bees

searching for the

nearby food source

Selection of

employed bees

based on probability

Onlooker bees turn into the

selected employed bees and

keep exploring

Searching for new

food sources

Reach to the

maximum iteration

or minimum error

End

Yes

No

43

5.3.2 CNN Fine-Tuning via Artificial Bee Colony

Consider the advantage of ABC algorithm in obtaining the globalized optimal solution, it has been

used in our experiment to fine-tune the fully connected layers of trained CNN models on CXR

datasets to improve the general performance.

The whole fine-tuning process can be regarded as searching for the appropriate parameters that

could further minimize the total loss in the CNN model. Start with randomly generated solutions,

ABC keeps looking for better solutions by searching the nearby regions of the current best solution

and abandoning the poor solutions.

In the beginning, a solution vector that contains a specific number of possible solutions is created.

To make full use of the previous training results, the first element of the solution vector is

initialized with weights and bias that are obtained from the trained CNN model. The rest elements

are generated nearby the obtained weights and bias in the given space by multiplying the first

solution vector with a random number between 0 and 1:

𝑠𝑜𝑙_𝑣𝑒𝑐 = (𝑤(𝑡)1, 𝑤(𝑡)2, … , 𝑤(𝑡)𝑛)

𝑤(0)1 = (𝑛𝑛.𝑊, 𝑛𝑛. 𝑏)

𝑤(0)𝑖 = 𝑟𝑎𝑛𝑑(0, 1)𝑤(0)1, 𝑖 = 2,3, … , 𝑛

where 𝑡 represent the total number of iterations needed during the whole fine-tuning process.

The generalization of multiple solutions will not only take advantage of the parameters from the

trained model but also avoid the model from falling into local optimal points during optimization.

Searching for the nearby solutions will then be started based on the initialized vectors:

44

𝑔𝑒𝑛_𝑣𝑒𝑐 = (𝑣(𝑡)1, 𝑣(𝑡)2, … , 𝑣(𝑡)𝑛)

𝑣(𝑘)𝑖 = 𝑤(𝑘 − 1)1 + 𝛷𝑖(𝑤(𝑘 − 1)𝑖 − 𝑤(𝑘 − 1)𝑗), 𝑖 ≠ 𝑗

where 𝑘 represents the 𝑘-th iteration of the optimization process.

Once the new solution is found nearby the initialized one, fitness value which measures the quality

of solutions will be computed for the comparison between old and new solutions according to the

following equation:

𝑓𝑖𝑡(𝑤(𝑘)𝑖) =1

1 + 𝐸(𝑤(𝑘)𝑖)

𝐸(𝑤(𝑘)𝑖) in the above equation is the loss function of the CNN model, which is the target function

that need to be optimized and will always result in a non-negative value. The loss function used in

our experiment is the cross-entropy loss:

𝐸(𝑤(𝑘)𝑖) = −1

𝑛∑ [𝑦𝑖 ln(𝑜(𝑘)𝑖) + (1 − 𝑦𝑖) ln(1 − 𝑜(𝑘)𝑖)]

𝑛

𝑖=1

where 𝑦𝑖 represents the expected output of the 𝑖-th sample within the training batch and 𝑜(𝑘)𝑖 is

the actual output of this sample from the 𝑘-th iteration.

The selection of a better solution will be proceeded based on the calculated probability of the

fitness values:

𝑝(𝑘)𝑖 =𝑓𝑖𝑡(𝑤(𝑘)𝑖)

∑ 𝑓𝑖𝑡(𝑤(𝑘)𝑖)𝑖

According to the above equations, for each generated solution, the smaller the loss, the larger the

fitness value, and there will be a greater probability to be selected as the final solution.

45

Figure 5-4 presents a flowchart of how the ABC algorithm works on fine-tuning a trained CNN

model.

Figure 5-4: Fine-tune the trained CNN model via artificial bee colony algorithm

ABC parameters

initialization

Yes

Start

Reach to the

maximum iteration

or minimum error

CNN model

pre-training

Global exploration of

parameters via ABC

Output of optimized

parameters

CNN model fine-tuning with

the optimized parameters

End

No

46

5.4 Experiment Settings

5.4.1 Experiment Descriptions

In our experiment, binary classification of CXR images for TB diagnosis which identifies the lung

abnormalities are performed by six different CNN models (VGG16, VGG19, Inception V3,

ResNet34, ResNet50 and ResNet101) on three open-source CXR datasets (Montgomery County

Chest X-Ray dataset, Shenzhen Hospital Chest X-Ray dataset and NIH Chest X-Ray8 dataset)

respectively.

For each CNN model that performs classification on each dataset, there are 3 stages included:

training with the original CNN architecture, training with the modified CNN architecture as well

as the differential learning rates, fine-tuning the trained modified CNN model via ABC.

At the first stage, all parameters within the CNN model that has the original architecture will be

frozen except for the last layer and the last layer will be trained on the target CXR dataset for 3

epochs with a learning rate equals to 1e-3. Since the CNN models used in our experiment are all

pre-trained on the ImageNet dataset for the classification of daily objects in 1000 categories,

according to transfer learning, features learned from previous layers must eventually transition

from general to specific by the last layer of the model that has a direct influence to the final output,

training the last layer on a new dataset for a few epochs at the beginning will reduce the time for

the model to converge so as to perform a new task. Then, parameters of the CNN model with its

last layer been precomputed will be unfrozen and the entire model will be trained on the target

datasets for 12 epochs with a learning rate equals to 1e-4. During the second stage, the last few

layers of the original CNN model are modified according to 5.2.1. These modified layers will later

be trained on the target CXR dataset for 3 epochs with a learning rate equals to 1e-3. After the first

47

step of training, divide the modified CNN model into 3 parts, unfreeze all parameters and train the

entire model on the target datasets for 12 epochs with different learning rates (1e-3/25, 1e-3/5 and

1e-3) that have been assigned to each part.

In the above 2 stages, during the whole training process that contains the fixed number of epochs,

models that achieve the highest classification accuracy on the validation set will be saved for

further evaluation. At the last stage, fully connected layers of the trained CNN models with

modified structures will be fine-tuned via the ABC algorithm to further improve the model’s

overall performance.

Moreover, the multi-class classification will be performed following the same procedure for the

further diagnosis of specific TB manifestations (consolidation, effusion, fibrosis, infiltration, mass,

nodule and pleural thickening) from the CXR images in NIH Chest X-Ray Dataset.

5.4.2 Ratio Comparison

The ratio comparison method is used in our experiment to compare the performance of different

CNN models on each dataset.

Since the training process of a CNN model has been designed to constantly adjust parameters

according to data from the training set and evaluate the performance on data from the validation

set so as to get a final objective idea of how the model might perform on unseen data and determine

whether to continue tuning the parameters, the existence of a testing set that is independent of sets

that involved in the training process is necessary to further measure the performance of a trained

model in an unbiased way. Therefore, for each classification task, a certain amount of CXR images

from the dataset will be set for testing the final performance of a trained CNN model and the rest

will be split into train-valid sets with the ratio of 90%-10%, 80%-20% and 70%-30% respectively.

48

It is worth noting that images reserved for testing purposes in each dataset remains the same

regardless of the variations in train-valid distributions for parallel comparisons of the trained CNN

models on the same dataset.

5.4.3 Parameters Settings

Our experiment is implemented in Python 3.5 with two deep learning libraries, PyTorch and

FastAI. The whole process runs on Windows 10 operating system with the following hardware

deployments:

Table 5-1: Hardware Deployments

CPU

Model Intel Xeon E5-2623 V4

Base Frequency 2.60GHz

# of Cores 8

GPU

Model NVidia Quadro P4000

GPU Memory 8GB GDDR5

NVidia CUDA Cores 1792

RAM 30GB

SSD 250GB

Detailed separation of CXRs with different train-valid ratios within each dataset for the basic

abnormality detections are given in Table 5-2.

Table 5-2: Image Separations for Abnormality Diagnosis

CXR Dataset

Train/Valid Ratio =

9:1

Train/Valid Ratio =

8:2

Train/Valid Ratio =

7:3 9:1 / 8:2 / 7:3

Training

Set

Valid

Set

Training

Set

Valid

Set

Training

Set

Valid

Set

Testing

Set

Montgomery

County

CXR Dataset

Normal 65 10 60 15 52 23 5

Abnormal

with TB 50 5 45 10 38 17 3

Shenzhen

Hospital

CXR Dataset

Normal 270 30 240 60 210 90 6

Abnormal

with TB 285 35 250 70 225 95 16

Chest X-

Ray8 Dataset

Normal 29250 3250 26000 6500 22750 9750 1831

Abnormal

with TB 18760 2090 16680 4170 14590 6260 464

49

The original number of CXRs in NIH Chest X-Ray8 Dataset for specific TB manifestations

diagnosis is given in Table 5-3.

Table 5-3: Original CXR Distribution with Specific TB Manifestations in Chest X-Ray8

Consolidation Effusion Fibrosis Infiltration Mass Nodule Pleural

Thickening

324 2035 641 5133 1313 1888 851

Since the distribution of CXRs under each TB manifestations class presents a strongly biased trend,

models trained on this dataset will perform a strong preference for their predictions. Therefore,

data augmentation has been implemented to increase the number of images under the classes with

fewer CXRs and thereby create an evenly distributed data to eliminate the interference.

In our experiment, common augmentation methods such as horizontal flip, rotate, contrast

adjustment and position translation are used to generate new images. After data augmentation, the

distribution of CXRs in NIH Chest X-Ray8 Dataset for TB manifestations diagnosis is presented

in Table 5-4. It is worth noting that images in the testing set have not been augmented to ensure

the quality of the results during the testing of the models’ performance on transfer learning.

Table 5-4: Augmented CXR Distribution in Chest X-Ray8 for TB Manifestations Diagnosis

TB Manifestations

Train/Valid Ratio =

9:1

Train/Valid Ratio =

8:2

Train/Valid Ratio =

7:3 9:1 / 8:2 / 7:3

Training

Set

Valid

Set

Training

Set

Valid

Set

Training

Set

Valid

Set

Testing

Set

Consolidation 4460 500 3970 990 3470 1490 14

Effusion 4500 500 4000 1000 3500 1500 85

Fibrosis 4460 500 3970 990 3470 1490 21

Infiltration 4500 500 4000 1000 3500 1500 133

Mass 4500 500 4000 1000 3500 1500 63

Nodule 4500 500 4000 1000 3500 1500 88

Pleural Thickening 4500 500 3980 1000 3480 1500 21

50

5.5 Results and Discussion

5.5.1 CNN Modification and Division with Different Learning Rates

We applied some modifications on the original CNN models by changing the structures of the last

few layers, separating the model into 3 parts and assigning each part with a different learning rate

during the training process to improve the model’s performance on the provided CXR datasets.

Accuracies of the validation set achieved by the six different CNN models introduced in chapter 4

and their modified structures during each training epoch on the three public CXR datasets with

different train/valid ratios for TB diagnosis are given respectively. Averaged accuracy of all six

CNN models has also been calculated at each stage to further compare and analyze the overall

trend on the same dataset with different train/valid ratios.

Table 5-5, Table 5-6 and Table 5-7 present the validation accuracy on Montgomery County CXR

Dataset, the smallest dataset among all three datasets used in our experiment. Among all raw

models, VGG19 and ResNet50 have the best and the most stable performance by achieving 90%,

88% and 93.33% of accuracy on the valid set with train/valid ratio equals to 7:3, 8:2, 9:1

respectively. As for the modified models, VGG19 and ResNet50 maintains their stability and

further improve their accuracy up to 97.5% for VGG19 as well as 95% for ResNet50, 96% and

100% with train/valid ratio equal to 7:3, 8:2, 9:1 respectively. The highest accuracies achieved by

other raw and modified CNN models is less than or equal to the ones achieved by VGG19 and

ResNet50, their general performance also varies with different train/valid ratios. During the whole

training process, modified CNN models tend to have a better performance with an obvious

increasing on accuracy for every epoch as compare to the raw models.

51

Figure 5-5 illustrates an overall performance of TB diagnosis for both raw and modified CNN

models on Montgomery County CXR Dataset with train/valid ratio equals to 7:3, 8:2, 9:1

respectively using the averaged accuracies among all six models. The given line chart shows an

obvious improvement of the accuracy provided by the modified CNN models among all train/valid

ratio cases compares to the raw models. Moreover, both raw and modified models tend to have the

best accuracy with the train/valid ratio equals to 9:1. As for the dataset with train/valid ratio equal

to 7:3 and 8:2, the nearly overlapped trending line indicates a similar performances of CNN models

in both raw and modified cases, but still models that are trained with train/valid ratio equals to 8:2

has a slightly better performance than those with 7:3 in general.

Table 5-5: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on Montgomery County

CXR Dataset for Abnormality Detection

CNN

Model

Train/Valid = 7:3

Accuracy on Validation Set for Each Epoch During the Training Process (%)

E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 65.00 70.00 67.50 72.50 75.00 80.00 82.50 77.50 82.50 80.00 82.50 77.50

VGG19 67.50 65.00 72.50 70.00 75.00 82.50 80.00 82.50 85.00 90.00 87.50 85.00

Incep V3 65.00 67.50 72.50 72.50 77.50 75.00 80.00 85.00 82.50 85.00 80.00 82.50

ResNet34 72.50 70.00 75.00 80.00 80.00 85.00 77.50 85.00 87.50 82.50 87.50 85.00

ResNet50 70.00 75.00 80.00 77.50 85.00 82.50 82.50 80.00 87.50 85.00 90.00 82.50

ResNet101 70.00 75.00 77.50 72.50 80.00 82.50 80.00 87.50 85.00 80.00 87.50 85.00

Average 68.30 70.42 74.17 74.17 78.75 81.25 80.42 82.92 85.00 83.75 85.80 82.92

Modi

fied

VGG16 70.00 75.00 82.50 87.50 82.50 85.00 82.50 90.00 85.00 87.50 92.50 85.00

VGG19 77.50 85.00 87.50 82.50 77.50 82.50 90.00 92.50 97.50 92.50 95.00 95.00

Incep V3 75.00 72.50 80.00 75.00 80.00 82.50 82.50 85.00 87.50 85.00 90.00 87.50

ResNet34 75.00 82.50 87.50 92.50 90.00 95.00 92.50 87.50 90.00 82.50 85.00 87.50

ResNet50 72.50 72.50 77.50 80.00 87.50 90.00 87.50 85.00 90.00 95.00 90.00 92.50

ResNet101 75.00 80.00 77.50 80.00 82.50 87.50 90.00 82.50 85.00 92.50 95.00 90.00

Average 74.17 77.92 82.08 82.92 83.33 87.08 87.50 87.08 89.17 89.17 91.25 89.58

52



CNN

Model

Train/Valid = 8:2


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 68.00 76.00 84.00 80.00 80.00 76.00 80.00 76.00 84.00 84.00 80.00 80.00

VGG19 72.00 68.00 76.00 80.00 84.00 80.00 84.00 80.00 88.00 84.00 80.00 84.00

Incep V3 68.00 72.00 80.00 76.00 72.00 80.00 84.00 80.00 80.00 80.00 84.00 84.00

ResNet34 76.00 80.00 72.00 76.00 80.00 84.00 76.00 88.00 84.00 80.00 88.00 84.00

ResNet50 72.00 76.00 80.00 76.00 84.00 84.00 80.00 80.00 84.00 88.00 84.00 88.00

ResNet101 76.00 72.00 80.00 84.00 80.00 88.00 80.00 84.00 88.00 80.00 88.00 84.00

Average 72.00 74.00 78.67 78.67 80.00 82.00 80.67 81.33 84.67 82.67 84.00 84.00

Modified

VGG16 76.00 80.00 76.00 84.00 88.00 84.00 80.00 88.00 84.00 88.00 92.00 88.00

VGG19 76.00 88.00 92.00 88.00 88.00 96.00 92.00 88.00 88.00 92.00 96.00 96.00

Incep V3 72.00 76.00 76.00 80.00 80.00 84.00 88.00 92.00 92.00 84.00 84.00 88.00

ResNet34 80.00 88.00 88.00 88.00 84.00 80.00 84.00 88.00 92.00 84.00 88.00 92.00

ResNet50 84.00 76.00 80.00 84.00 84.00 88.00 88.00 76.00 92.00 92.00 96.00 92.00

ResNet101 80.00 84.00 76.00 88.00 92.00 88.00 84.00 88.00 84.00 92.00 96.00 88.00

Average 78.00 82.00 81.33 85.33 86.00 86.67 86.00 86.67 88.67 88.67 92.00 90.67



CNN

Model

Train/Valid = 9:1


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 60.00 73.33 73.33 80.00 73.33 80.00 80.00 73.33 86.67 80.00 86.67 80.00

VGG19 66.67 73.33 73.33 80.00 86.67 93.33 80.00 86.67 86.67 93.33 80.00 86.67

Incep V3 60.00 66.67 66.67 73.33 80.00 80.00 86.67 86.67 73.33 80.00 86.67 80.00

ResNet34 73.33 66.67 73.33 80.00 86.67 86.67 73.33 80.00 80.00 86.67 80.00 86.67

ResNet50 66.67 80.00 73.33 86.67 80.00 86.67 73.33 80.00 93.33 86.67 93.33 86.67

ResNet101 73.33 66.67 80.00 86.67 73.33 80.00 86.67 93.33 86.67 93.33 80.00 86.67

Average 66.67 71.10 73.30 81.11 80.00 84.45 80.00 83.33 84.45 86.67 84.45 84.45

Modified

VGG16 86.67 86.67 80.00 93.33 93.33 93.33 86.67 86.67 80.00 86.67 93.33 86.67

VGG19 93.33 86.67 93.33 80.00 93.33 86.67 80.00 93.33 93.33 100 86.67 100

Incep V3 73.33 66.67 80.00 73.33 86.67 80.00 93.33 86.67 86.67 80.00 93.33 86.67

ResNet34 80.00 86.67 93.33 93.33 80.00 86.67 86.67 86.67 93.33 93.33 93.33 93.33

ResNet50 86.67 80 86.67 86.67 93.33 93.33 100 93.33 86.67 93.33 93.33 100

ResNet101 86.67 93.33 86.67 80 93.33 100 86.67 93.33 100 86.67 86.67 93.33

Average 84.45 83.34 86.67 84.44 90.00 90.00 88.89 90.00 90.00 90.00 91.11 93.33

53

Figure 5-5: Averaged Accuracy Comparison on Montgomery County CXR Dataset for Abnormality Detection

Table 5-8, Table 5-9 and Table 5-10 give the validation accuracy for TB diagnosis via different

CNN models on Shenzhen Hospital CXR Dataset. During the training of both raw and modified

CNN models, the residual CNN series, ResNet34, ResNet50 and ResNet101, provided a stable and

outstanding performance by achieving the top three accuracies on the validation set with train/valid

ratio equal to 7:3, 8:2, 9:1 respectively. Moreover, modified CNN models also tend to have a

generally higher accuracy over the raw models.

The line chart from Figure 5-6 illustrates a generally better performance in TB diagnosis provided

modified models as compared to the raw models among all three train/valid ratio cases on

Shenzhen Hospital CXR Dataset. Besides, for both raw and modified models, the overall accuracy

on the validation set with train/valid ratio equals to 9:1 is better than those with 8:2 and then 7:3

in general.

54

Table 5-8: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on Shenzhen Hospital CXR

Dataset for Abnormality Detection

CNN

Model

Train/Valid = 7:3


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 61.62 70.27 78.38 82.70 83.78 81.62 85.95 87.03 84.86 87.03 85.41 85.41

VGG19 70.81 76.76 80.54 84.32 82.16 83.78 80.54 86.49 85.41 86.49 87.03 85.41

Incep V3 74.05 71.35 81.08 80.00 83.24 84.32 84.86 87.03 86.49 87.03 84.32 85.95

ResNet34 77.30 76.22 80 82.16 85.95 87.57 85.41 88.11 87.03 87.57 86.49 84.32

ResNet50 74.59 78.38 81.62 83.78 86.49 81.08 84.32 83.78 87.03 85.95 87.03 86.49

ResNet101 74.59 80 82.70 83.24 85.41 80 86.49 84.32 88.11 87.57 86.49 87.03

Average 72.16 75.50 80.72 82.70 84.51 83.00 84.60 86.13 86.49 86.94 86.13 85.77

Modified

VGG16 77.84 80.00 82.70 89.73 87.57 88.11 83.24 85.41 87.03 84.32 88.11 87.03

VGG19 79.46 82.70 85.41 82.16 88.11 89.73 87.03 90.27 91.35 90.81 91.89 91.89

Incep V3 77.84 82.70 80.00 83.24 88.65 87.57 89.73 91.35 90.27 91.35 89.73 89.73

ResNet34 80.00 76.76 86.49 83.78 87.03 90.81 91.89 90.81 91.35 88.65 90.27 90.81

ResNet50 81.62 85.41 87.57 88.11 87.03 77.84 90.81 90.27 91.35 89.73 89.19 90.27

ResNet101 80.00 82.16 83.24 83.78 86.49 84.86 88.65 91.89 90.81 91.35 89.73 90.81

Average 79.46 81.62 84.24 85.13 87.48 86.49 88.56 90.00 90.36 89.37 89.82 90.09



CNN

Model

Train/Valid = 8:2


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 75.38 80.00 79.23 81.54 83.08 86.15 85.38 88.46 86.92 89.23 87.69 86.92

VGG19 78.46 80.77 80.00 83.85 85.38 87.69 89.23 88.46 86.92 89.23 87.69 87.69

Incep V3 75.38 82.31 83.08 80.00 84.62 86.92 88.46 90.00 88.46 90.77 90.00 87.69

ResNet34 80.77 82.31 81.54 84.62 86.15 88.46 90.00 89.23 91.54 88.46 91.54 90.77

ResNet50 80.00 81.54 83.85 86.92 83.85 87.69 86.92 90.00 90.77 86.92 90.00 90.77

ResNet101 79.23 81.54 82.31 83.85 87.69 84.62 88.46 90.77 89.23 91.54 90.77 91.54

Average 78.20 81.41 81.67 83.46 85.13 86.92 88.08 89.49 88.97 89.36 89.62 89.23

Modified

VGG16 81.54 82.31 83.08 84.62 85.38 90.00 89.23 91.54 88.46 92.31 90.77 90.00

VGG19 82.31 83.08 85.38 87.69 84.62 88.46 91.54 89.23 92.31 87.69 90.77 91.54

Incep V3 83.85 86.15 88.46 87.69 85.38 89.23 90.77 92.31 92.31 90.00 91.54 89.23

ResNet34 81.54 86.15 84.62 87.69 90.77 93.85 92.31 94.62 95.38 89.23 91.54 93.08

ResNet50 80.77 85.38 86.92 88.46 88.46 91.54 90.00 93.85 92.31 94.62 89.23 90.77

ResNet101 84.62 88.46 83.85 89.23 91.54 93.08 93.85 96.15 94.62 95.38 91.54 92.31

Average 82.44 85.26 85.39 87.56 87.69 91.03 91.28 92.95 92.57 91.54 90.90 91.16

55



CNN

Model

Train/Valid = 9:1


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 76.92 78.46 83.08 86.15 83.08 87.69 87.69 86.15 84.62 89.23 90.77 87.69

VGG19 79.23 81.54 82.31 86.15 87.69 84.62 89.23 90.77 87.69 86.15 89.23 89.23

Incep V3 78.46 81.54 84.62 83.08 86.15 87.69 89.23 90.77 89.23 90.77 87.69 89.23

ResNet34 81.54 83.08 84.62 82.31 87.69 86.15 89.23 90.77 90.77 92.31 89.23 90.77

ResNet50 80.77 83.08 86.15 84.62 84.62 87.69 89.23 90.77 89.23 92.31 90.77 92.31

ResNet101 81.54 80.00 84.62 86.15 87.69 89.23 88.46 92.31 90.77 93.85 92.31 90.77

Average 79.74 81.28 84.23 84.74 86.15 87.18 88.85 90.26 88.72 90.77 90.00 90.00

Modified

VGG16 82.31 83.85 84.62 88.46 86.15 87.69 89.23 89.23 90.77 90.77 92.31 90.77

VGG19 83.08 86.15 84.62 89.23 88.46 87.69 90.77 93.85 92.31 89.23 91.54 92.31

Incep V3 83.08 85.38 87.69 86.15 87.69 88.46 89.23 90.77 91.54 93.08 91.54 92.31

ResNet34 83.08 87.69 89.23 90.00 88.46 91.54 93.85 95.38 92.31 92.31 95.38 93.85

ResNet50 84.62 86.15 87.69 89.23 91.54 93.85 88.46 92.31 95.38 93.85 90.77 92.31

ResNet101 83.08 86.15 87.69 90.77 89.23 92.31 93.85 90.77 95.38 96.92 92.31 95.38

Average 83.21 85.90 86.92 88.97 88.59 90.26 90.90 92.05 92.95 92.69 92.31 92.82

Figure 5-6: Averaged Accuracy Comparison on Shenzhen Hospital CXR Dataset for Abnormality Detection

56

Table 5-11, Table 5-12 and Table 5-13 give the validation accuracy for TB diagnosis via different

CNN models on NIH Chest X-Ray8 Dataset with different train/valid ratios respectively. During

the training process of both raw and modified CNN models, the validation accuracy increases with

less oscillation compare to ones that have been trained on the previous two smaller datasets. In

general, the residual CNN series presents a relatively stable and outstanding accuracy compare to

other models in both raw and modified cases.

Figure 5-7 illustrates a parallel comparison of the averaged validation accuracy on NIH Chest X-

Ray8 Dataset with different train/valid ratios in both raw and modified cases. Each mild upward

line in the figure indicates a smooth increasing of the validation accuracy during training in each

case. Same as the cases of the previous two datasets, the modified models have a significant

improvement on accuracy. Moreover, the overall accuracy on the validation set with train/valid

ratio equals to 9:1 is better than those with 8:2 and then 7:3 in general.

Table 5-11: Valid Accuracy of Each Epoch During Training with Train/Valid Ratio=7:3 on NIH Chest X-Ray8


CNN

Model

Train/Valid = 7:3


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 72.50 74.77 75.33 76.38 76.98 77.26 77.61 78.93 79.17 78.20 79.18 80.31

VGG19 76.16 76.94 77.84 78.54 78.36 79.06 79.91 80.46 80.77 79.95 80.62 81.17

Incep V3 77.11 78.06 78.44 79.35 79.63 79.58 79.87 79.78 80.74 81.34 81.64 81.36

ResNet34 77.46 77.95 79.23 79.76 80.15 80.64 80.62 81.79 82.24 82.54 82.40 82.48

ResNet50 75.06 76.25 77.18 77.48 78.41 79.50 79.94 80.21 80.58 80.50 79.52 80.92

ResNet101 72.64 73.77 74.70 77.45 77.58 79.76 80.29 80.73 81.53 81.88 82.20 81.17

Average 75.16 76.29 77.12 78.16 78.52 79.30 79.71 80.32 80.84 80.74 80.93 81.24

Modified

VGG16 78.93 79.36 80.26 82.24 82.48 83.24 84.45 85.33 86.57 86.61 87.29 87.07

VGG19 85.33 86.75 87.14 85.95 84.88 86.99 86.57 87.29 87.17 87.68 86.33 87.01

Incep V3 82.48 83.53 84.43 85.72 86.12 86.85 86.71 87.46 87.20 87.98 87.98 87.17

ResNet34 83.75 86.85 87.24 88.00 86.84 87.41 87.25 87.98 87.66 88.19 87.95 88.06

ResNet50 84.64 84.92 85.73 86.57 86.99 86.50 87.91 87.87 88.03 87.76 87.95 87.17

ResNet101 87.09 85.61 87.58 86.34 84.68 87.88 88.44 86.91 86.18 87.65 87.54 88.23

Average 83.70 84.50 85.40 85.80 85.33 86.48 86.89 87.14 87.14 87.65 87.51 87.45

57



CNN

Model

Train/Valid = 8:2


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 78.80 79.98 80.05 80.86 80.62 81.01 81.47 81.79 82.49 82.91 83.09 83.06

VGG19 77.39 78.34 79.28 79.63 79.92 80.97 80.14 80.70 80.86 81.96 82.54 82.66

Incep V3 78.75 79.75 80.20 81.15 81.32 81.19 82.36 83.69 84.19 84.09 84.69 85.38

ResNet34 81.20 80.13 80.98 81.40 80.54 82.68 82.86 83.69 84.19 84.09 84.69 85.38

ResNet50 77.64 78.80 79.63 80.34 79.37 80.97 81.70 82.06 82.35 83.35 82.76 81.89

ResNet101 74.97 75.94 77.20 77.68 78.93 79.48 81.49 81.53 81.74 83.40 83.53 83.81

Average 78.13 78.82 79.56 80.18 80.12 81.05 81.67 82.24 82.64 83.30 83.55 83.70

Modified

VGG16 80.84 81.60 82.00 82.32 84.21 84.93 87.01 88.49 89.79 90.22 89.53 89.07

VGG19 81.38 81.90 82.38 83.79 85.07 85.87 87.17 88.08 89.48 90.38 89.73 90.28

Incep V3 82.16 83.74 84.25 84.87 86.57 87.53 88.54 88.40 89.24 90.14 91.09 90.62

ResNet34 82.49 85.67 87.28 88.05 89.35 88.85 89.60 89.75 90.63 91.31 90.82 91.26

ResNet50 82.16 83.04 83.60 84.78 87.82 88.37 88.83 89.86 90.39 90.23 90.53 89.71

ResNet101 83.86 84.70 86.86 87.72 88.81 89.87 89.35 90.15 89.74 90.31 90.87 90.15

Average 82.15 83.44 84.40 85.26 86.97 87.57 88.42 89.12 89.88 90.43 90.43 90.18

Table 5-13: Valid Accuracy for Each Epoch During Training with Train/Valid Ratio=9:1 on NIH Chest X-Ray8

Dataset

CNN

Model

Train/Valid = 9:1


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 81.93 83.58 83.88 85.43 84.98 85.15 86.85 87.39 86.76 87.88 88.16 87.34

VGG19 81.17 82.79 83.50 84.57 84.79 84.98 85.99 85.94 86.37 87.36 87.34 86.61

Incep V3 81.01 82.85 82.79 83.87 84.10 85.19 86.18 86.31 86.20 88.24 88.54 88.07

ResNet34 82.25 83.22 84.63 84.79 85.19 86.05 86.74 86.24 86.10 86.29 87.32 88.07

ResNet50 79.87 81.87 82.68 85.02 86.03 87.32 87.58 88.50 88.63 88.80 89.78 89.08

ResNet101 77.47 79.44 81.24 82.58 82.90 85.58 80.82 85.13 88.35 88.99 88.71 88.48

Average 80.62 82.29 83.12 84.38 84.67 85.71 85.69 86.59 87.07 87.93 88.31 87.94

Modified

VGG16 84.81 88.56 88.07 90.51 90.75 91.52 91.39 90.39 92.02 93.54 93.69 92.15

VGG19 88.40 89.53 88.31 90.34 91.05 89.98 88.16 90.97 91.01 92.17 93.22 91.22

Incep V3 89.66 88.07 90.52 91.33 90.88 91.93 92.47 92.15 93.60 92.32 91.18 93.26

ResNet34 89.55 88.84 90.73 89.83 91.27 90.69 91.76 92.79 93.86 93.65 92.59 92.37

ResNet50 90.19 91.27 90.77 92.30 92.94 93.46 92.92 93.91 94.38 94.23 94.42 93.22

ResNet101 87.58 89.10 93.16 92.75 91.69 92.45 93.45 94.06 93.26 92.06 90.69 93.09

Average 88.37 89.23 90.26 91.18 91.43 91.67 91.69 92.38 93.02 93.00 92.63 92.55

58

Figure 5-7: Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for Abnormality Detection

Table 5-14, Table 5-15 and Table 5-16 show the validation accuracy for the detection of specific

TB manifestation among seven TB related diseases on NIH Chest X-Ray8 Dataset with different

train/valid ratios respectively. Among all CNN models, ResNet34 presents an outstanding

performance during the multi-class classification process in both raw and modified conditions.

Figure 5-8 illustrates a parallel comparison of the averaged validation accuracies provided by both

raw and modified models for the detection of TB related diseases. The plot shows that the accuracy

as well as its growth rate achieved by the modified models during the training process present an

obvious increasing trend than that by raw models. In addition, accuracies achieved on the

validation set with train/valid ratio equals to 9:1 is higher than those with 8:2 and then 7:3 in

general.

In general, the overall accuracies achieved by CNN models with modified structures are

significantly higher than those by models in their original structures.

59


Dataset for TB Related Disease Detection

CNN

Model

Train/Valid = 7:3


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 41.68 43.01 44.71 46.29 47.17 48.01 48.92 48.33 49.27 49.95 50.88 51.29

VGG19 34.51 36.95 37.79 38.80 41.26 42.10 44.05 45.75 47.55 48.33 48.68 50.15

Incep V3 41.30 41.83 43.04 44.72 45.27 46.72 47.63 48.44 48.77 49.02 50.06 51.23

ResNet34 39.88 40.74 42.30 44.12 45.13 46.07 48.06 48.92 49.40 50.88 51.06 52.22

ResNet50 40.37 42.54 44.66 45.78 46.53 47.24 48.08 49.73 49.88 51.36 52.14 52.42

ResNet101 37.02 40.31 41.85 43.10 46.15 47.60 48.23 48.38 49.27 50.06 51.07 51.75

Average 39.13 40.90 42.39 43.80 45.34 46.29 47.50 48.26 49.02 49.93 50.65 51.51

Modified

VGG16 52.78 55.31 61.66 69.73 71.06 76.98 78.39 79.78 81.10 80.52 81.85 81.51

VGG19 54.53 57.40 64.66 70.85 72.20 76.69 78.10 80.52 81.13 81.97 81.68 81.85

Incep V3 52.22 54.53 57.24 61.66 66.56 72.66 76.69 78.68 80.30 81.14 81.45 80.52

ResNet34 51.06 55.31 59.32 64.66 66.17 71.74 73.38 75.24 77.70 79.78 81.89 82.42

ResNet50 51.75 54.46 57.24 61.37 65.36 67.02 71.45 73.19 75.24 79.37 81.10 80.30

ResNet101 53.37 59.32 61.06 66.83 69.27 73.33 74.38 76.66 78.36 79.75 82.71 82.47

Average 62.62 56.06 60.20 65.85 68.44 73.07 75.40 77.34 78.97 80.42 81.78 81.51



CNN

Model

Train/Valid = 8:2


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 43.95 44.21 45.79 46.26 47.58 47.81 50.04 52.06 52.41 52.88 52.77 54.46

VGG19 41.76 43.41 45.34 46.70 48.40 49.67 50.14 50.83 51.83 52.36 52.69 53.58

Incep V3 43.94 45.35 46.07 47.41 48.48 49.18 50.36 51.69 52.45 52.69 53.17 53.60

ResNet34 45.73 46.03 47.32 48.45 49.58 50.14 51.09 52.41 52.77 52.91 54.24 54.89

ResNet50 42.62 44.14 44.96 45.70 47.48 48.52 49.61 50.70 51.70 51.93 53.41 54.46

ResNet101 39.01 42.61 43.55 42.75 46.07 48.55 48.83 49.33 52.52 52.11 53.58 55.86

Average 42.84 44.29 45.51 46.21 47.93 48.98 50.01 51.17 52.28 52.48 53.31 54.48

Modified

VGG16 62.14 70.75 73.94 76.32 79.43 80.56 81.48 82.13 84.17 84.99 85.96 85.19

VGG19 59.90 67.15 73.15 75.34 79.10 81.78 82.26 82.68 82.49 83.84 84.11 86.12

Incep V3 55.79 63.88 68.91 70.89 76.02 78.04 80.70 81.13 82.09 82.65 84.97 83.94

ResNet34 57.99 66.58 69.43 76.96 78.87 81.63 83.71 84.67 85.13 85.10 85.44 85.07

ResNet50 60.70 65.11 69.79 72.22 76.81 78.40 81.55 81.47 81.95 81.79 82.41 84.01

ResNet101 56.68 61.95 67.59 71.89 74.87 78.61 80.20 81.94 82.69 83.50 83.87 83.08

Average 58.87 65.90 70.47 73.94 77.52 79.84 81.65 82.34 83.09 83.65 84.46 84.57

60



CNN

Model

Train/Valid = 9:1


E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 E 10 E 11 E 12

Raw

VGG16 45.11 45.94 47.34 48.29 49.51 50.54 51.69 53.80 54.80 56.29 56.29 56.80

VGG19 46.69 50.57 51.23 51.49 53.09 52.23 53.66 54.94 53.86 55.57 56.14 56.83

Incep V3 46.43 47.11 48.46 49.80 51.11 51.46 52.43 52.69 55.40 55.94 57.17 58.54

ResNet34 46.11 46.40 48.43 50.06 50.34 51.09 52.80 53.60 55.37 56.06 58.20 58.17

ResNet50 43.54 45.66 46.57 48.37 48.71 50.29 51.69 52.40 53.09 55.03 55.66 57.71

ResNet101 43.26 44.63 46.20 47.49 48.43 49.74 50.71 51.86 53.17 55.09 56.91 58.97

Average 45.19 46.72 48.04 49.25 50.20 50.89 52.16 53.22 54.28 55.66 56.73 57.84

Modified

VGG16 65.52 71.43 72.31 76.46 80.29 80.20 81.82 83.22 83.63 84.02 85.17 86.40

VGG19 67.06 70.74 77.77 77.94 80.26 83.31 83.94 84.97 84.94 86.03 86.46 85.31

Incep V3 67.00 73.71 75.63 78.60 81.40 82.94 84.57 84.51 86.03 87.40 86.63 86.66

ResNet34 64.31 69.00 70.46 76.74 80.77 81.43 84.40 84.74 85.17 85.71 86.25 86.43

ResNet50 60.14 66.49 69.57 75.45 76.14 78.14 82.49 83.49 84.43 81.69 85.74 84.49

ResNet101 61.06 66.49 69.57 75.06 75.46 78.14 76.14 82.49 83.49 84.43 84.94 81.69

Average 64.18 69.64 72.55 76.71 79.05 80.69 82.23 83.90 84.62 84.88 85.87 85.16

Figure 5-8: Averaged Accuracy Comparison on NIH Chest X-Ray8 Dataset for TB Related Disease Detection

61

5.5.2 Fine-Tune the Modified CNN Models Via ABC

After the model modifications, we have further improved the classification accuracy by fine-tuning

the fully connected layer of modified CNN models that have been trained on the target CXR dataset

via the ABC algorithm.

For different train/valid ratios equal to 9:1, 8:2 and 7:3 respectively, accuracy of both validation

and testing set provided by the raw, modified and fine-tuned models are presented for comparison.

Statistics that measure the detailed model performance such as recall, precision, Area Under ROC

curve (AUC) etc. will be given in Chapter 6.

Accuracy, the ratio of the correctly labeled images among all, is calculated via the following

equation:

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =∑ 𝑇𝐶𝑖

𝑛𝑖=1

𝑁

where 𝑛 represents the total number of classes, 𝑇𝐶𝑖 indicates the number of correctly classified

instances within each class, and 𝑁 is the total number of instances in the validation set.

Table 5-17 presents the validation and testing accuracy for different CNN models on Montgomery

County CXR Dataset. For each train/valid ratio, the best validation accuracy shows an increasing

trend from using raw models to the modified models, and then to the fine-tuned models via ABC.

This overall improvement of accuracy across each stage remains for the testing set.

However, there are some inconsistencies of the validation and testing accuracy achieved by the

same model. For example, among all raw models, when the train/valid ratio equals to 8:2,

ResNet34 that achieves 88% of accuracy on the validation set only receives 62.5% of accuracy on

the testing set, and VGG19 that has the best validation accuracy of 93.33% only reaches 75% of

62

the testing accuracy. The same problem also exists in the modified and fine-tuned CNN models.

Reasons that may cause this inconsistent performance varies. The first reason is that since the

validation set is involved in the training and fine-tuning process to evaluate a model’s performance

after each epoch of parameter updating based on the learning from the training set, the model might

have been overfitted which presents a high accuracy on the validation set but poor on the testing

set. Moreover, as the smallest CXR dataset used in our experiment, there are only 8 images in the

testing set, this small number of test cases is not representative and might lead to a biased result.

This is also the main cause of the fluctuations on the testing accuracy.

Table 5-17: Ratio Validation and Testing Accuracy Results on Montgomery County Chest X-Ray Dataset for

Abnormality Detection

CNN

Model

Train/Valid Ratio = 7:3 Train/Valid Ratio = 8:2 Train/Valid Ratio = 9:1

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 82.50% 75.00% 84.00% 100% 86.67% 87.50%

VGG19 90.00% 75.00% 88.00% 87.50% 93.33% 75.00%

Inception V3 85.00% 87.50% 84.00% 87.50% 86.67% 87.50%

ResNet34 87.50% 87.50% 88.00% 62.50% 86.67% 100%

ResNet50 90.00% 87.50% 88.00% 75.00% 93.33% 100%

ResNet101 87.50% 75.00% 88.00% 75.00% 93.33% 87.50%

Modified

Model

VGG16 92.50% 100% 92.00% 87.50% 93.33% 100%

VGG19 97.50% 100% 96.00% 87.50% 100% 87.50%

Inception V3 90.00% 100% 92.00% 100% 93.33% 62.50%

ResNet34 95.00% 62.50% 92.00% 100% 93.33% 75.00%

ResNet50 95.00% 87.50% 96.00% 50.00% 100% 75.00%

ResNet101 95.00% 87.50% 96.00% 87.50% 100% 87.50%

Modified

Model

Fine-tuned

by ABC

VGG16 95.00% 87.50% 92.00% 87.50% 100% 75.00%

VGG19 100 % 100% 100% 75.00% 100% 87.50%

Inception V3 95.00% 62.50% 92.00% 100% 93.33% 87.50%

ResNet34 97.50% 75.00% 96.00% 75.00% 100% 100%

ResNet50 97.50% 87.50% 100% 87.50% 100% 87.50%

ResNet101 100% 100% 100% 100% 100% 87.50%

63

Table 5-18 gives the validation and testing accuracy achieved by different CNN models on

Shenzhen Hospital CXR Dataset. For train/valid ratio equals to 7:3, 8:2, 9:1 respectively, the

accuracy of both validation and testing set increases following the three stages of our experiment.

The inconsistent performance of models on validation and testing set also exists in this dataset.

Since there are only 22 images in the testing set which is still a relatively small number and not

representative enough, the appearance of the fluctuations and unstable performance between the

validation and testing set is acceptable to some extent.

Table 5-18: Ratio Validation and Testing Accuracy Results on Shenzhen Hospital Chest X-Ray Dataset for

Abnormality Detection

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 87.03% 81.82% 89.23% 77.27% 90.77% 68.18%

VGG19 87.03% 77.27% 89.23% 50.00% 90.77% 63.64%

Inception V3 87.03% 81.82% 90.77% 72.73% 90.77% 72.73%

ResNet34 88.11% 77.27% 91.54% 72.73% 92.31% 86.36%

ResNet50 87.03% 68.18% 90.77% 77.27% 92.31% 86.36%

ResNet101 88.11% 72.73% 91.54% 81.82% 93.85% 77.27%

Modified

Model

VGG16 89.73% 86.36% 92.31% 81.82% 92.31% 86.36%

VGG19 91.89% 86.36% 92.31% 90.91% 93.85% 81.82%

Inception V3 91.35% 86.36% 92.31% 86.36% 93.85% 77.27%

ResNet34 91.89% 86.36% 95.38% 72.73% 95.38% 77.27%

ResNet50 91.35% 86.36% 94.62% 68.18% 95.38% 90.91%

ResNet101 91.89% 77.27% 96.15% 86.36% 96.92% 77.27%

Modified

Model

Fine-tuned

by ABC

VGG16 92.43% 86.36% 92.31% 81.82% 93.85% 72.73%

VGG19 93.51% 90.91% 93.08% 81.82% 95.38% 86.36%

Inception V3 91.89% 77.27% 93.08% 86.36% 93.85% 77.27%

ResNet34 92.43% 72.73% 96.15% 81.82% 96.92% 81.82%

ResNet50 94.05% 86.36% 95.39% 90.90% 96.92% 86.36%

ResNet101 92.97% 81.82% 96.92% 90.90% 98.46% 95.45%

64

Table 5-19 presents the validation and testing accuracy results on NIH Chest X-Ray8 Dataset for

lung abnormality detection. Same as the previous two datasets, the accuracy of both validation and

testing set keeps increasing from using raw models to the modified models, and then to the fine-

tuned models for all train/valid ratio cases. Besides, the inconsistent performance on validation

and testing set no longer exists since the number of images in the testing set has been increased to

over 2,000. This larger data used for testing greatly reduce the fluctuations of the model’s

performance on testing cases and thus will provide a more unbiased and accurate measure of a

model in practical applications. Moreover, as the largest dataset used in our experiment, data used

for training is also very large and comprehensive. Therefore, even for the original CNN models

with relatively low accuracy on the validation set, they still present a good performance on the

testing set.

Table 5-19: Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8 Dataset for Abnormality

Detection

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 80.31% 90.11% 83.09% 93.03% 88.16% 92.33%

VGG19 81.17% 90.76% 82.66% 92.37% 87.36% 91.72%

Inception V3 81.64% 91.94% 84.06% 91.50% 88.54% 93.86%

ResNet34 82.54% 92.46% 85.38% 94.77% 88.07% 93.68%

ResNet50 80.92% 92.37% 83.35% 91.24% 89.78% 92.98%

ResNet101 82.20% 92.64% 83.81% 89.32% 89.48% 94.60%

Modified

Model

VGG16 87.29% 96.69% 90.22% 97.69% 93.69% 97.91%

VGG19 87.68% 97.69% 90.38% 97.30% 93.22% 98.00%

Inception V3 87.99% 96.51% 91.09% 98.82% 93.60% 97.65%

ResNet34 88.19% 97.21% 91.31% 97.21% 93.86% 97.82%

ResNet50 88.03% 97.86% 90.53% 97.86% 94.42% 98.17%

ResNet101 88.44% 97.99% 90.87% 97.78% 94.06% 98.61%

Modified

Model

Fine-tuned

by ABC

VGG16 87.97% 97.60% 91.15% 98.56% 93.82% 98.00%

VGG19 88.06% 96.73% 90.86% 98.39% 94.16% 98.04%

Inception V3 88.74% 98.21% 91.49% 97.91% 94.48% 97.65%

ResNet34 88.42% 97.91% 91.40% 97.08% 94.81% 98.26%

ResNet50 88.84% 98.26% 90.96% 98.61% 94.61% 98.87%

ResNet101 88.69% 97.39% 90.97% 98.78% 94.12% 96.82%

65

Table 5-20 presents the accuracy of different CNN models on both validation and testing set of the

NIH Chest X-Ray8 Dataset for the diagnose of specific lung diseases. Compared to the

performance of lung abnormality detection, TB related disease detection is a more complicated

task which requires the model to have the ability to recognize multiple diseases’ patterns and make

predictions. Therefore, the overall performance on finishing this task is not as good as on the

abnormality detection. Same as the trend observed from the experiments that have been done for

lung abnormality detection previously, the accuracy of both validation and testing set keeps

increasing from using raw models to the modified models, and then to the fine-tuned models for

all train/valid ratio cases. Moreover, compared to the fine-tuning process, modification of the CNN

structure contributes more to improving the classification accuracy.

Table 5-20: Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8 Dataset for TB Related Disease

Detection

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 51.29% 54.05% 54.46% 53.88% 56.80% 58.82%

VGG19 50.15% 54.00% 53.58% 52.71% 56.83% 57.41%

Inception V3 51.23% 48.24% 53.60% 52.71% 58.54% 57.65%

ResNet34 52.22% 53.41% 54.89% 56.47% 58.20% 59.53%

ResNet50 52.42% 48.00% 54.46% 48.00% 57.71% 59.06%

ResNet101 51.75% 53.18% 55.86% 54.35% 58.97% 61.18%

Modified

Model

VGG16 81.85% 81.65% 85.96% 88.94% 86.40% 88.71%

VGG19 81.97% 77.65% 86.12% 83.29% 86.46% 86.35%

Inception V3 81.45% 80.47% 84.97% 85.41% 87.40% 90.59%

ResNet34 82.42% 84.24% 85.44% 86.35% 86.43% 88.70%

ResNet50 81.10% 80.24% 84.01% 84.24% 85.74% 88.24%

ResNet101 82.70% 80.71% 83.87% 83.29% 84.94% 82.59%

Modified

Model

Fine-tuned

by ABC

VGG16 82.63% 83.06% 86.26% 87.29% 87.40% 87.53%

VGG19 82.47% 80.94% 86.68% 84.94% 86.94% 87.53%

Inception V3 82.81% 85.18% 85.62% 87.53% 88.86% 88.24%

ResNet34 83.62% 84.00% 86.07% 86.59% 86.54% 85.65%

ResNet50 82.66% 85.41% 84.11% 83.76% 86.43% 86.59%

ResNet101 83.94% 84.00% 84.54% 78.59% 85.00% 84.71%

66

5.5.3 Discussion and Conclusion

In this chapter, the performance of the CNN models for the diagnose of TB and its specific

manifestations have been compared on the three public CXR datasets. Results have been given

and discussed in 1.5.1 and 1.5.2.

From the result analysis, modified CNN models generally present a significant improvement of

the whole training process and thereby receive a higher accuracy on the validation sets of all three

datasets with different train/valid ratios compare to the raw models. Besides, modified CNN

models that have been fine-tuned via the ABC algorithm slightly improve the models’ performance

and generally achieve the highest validation accuracy across all models.

However, the inconsistent performance on validation and testing set provided by the same model

exist in the experiment on the two smaller datasets, Montgomery County Chest X-Ray Dataset and

Shenzhen Hospital Chest X-Ray Dataset, since the number of images in the testing set is very

small so that data used for testing is not representative enough and will cause the fluctuations on

the testing accuracy. This inconsistency problem doesn’t show up in NIH Chest X-Ray8 Dataset

since the great number of testing images eliminates the fluctuations and helps provide a more

unbiased and accurate measure of a CNN model during the prediction task for cases that haven’t

seen before. Moreover, the performance of each participated CNN model varies on each dataset

with different train/valid ratios. Both the amount of data and the model itself will bring a lot of

unstable and unpredictable factors to influence the model’s final performance. This also makes it

harder to find the best model that can be used for TB diagnosis.

Therefore, the concept of ensemble CNN structure will be proposed in chapter 6 to help generate

a stable model with consistent and better performance regardless of the external conditions.

67

Chapter 6

Increasing Accuracy of TB Detection via Ensemble Model

6.1 Ensemble Learning

Ensemble learning [65] is a machine learning method that integrates multiple learners and

generates the final output based on the results provided by the integrated learners to achieve better

performance. Learners that are used for ensemble purposes need to maintain sufficient diversity to

capture different features from the same target. Therefore, two factors need to be considered during

the ensemble learning process, one is the selection and training of each learner, the other is how to

organize the results from different learners to produce the final output.

So far, different ensemble learning methods such as bagging, boosting, stacking etc. have been

proposed and widely used for specific tasks.

Bagging method [66] starts with the training data. It randomly generates multiple subsets from the

training set and trains each classifier on one of the generated subsets. Later on, all those weakly

trained classifiers will be integrated following a certain algorithm to generate the ensemble

architecture. This method uses different training subsets to ensure the diversity of classifiers.

Boosting method [67] mainly focus on reducing the bias among base models. At first, each data in

the training set has been assigned the same weight. During the constant training and validation,

the weights of data that has been misclassified will be increased while others remain the same.

68

Multiple classifiers are generated via different weight updating processes to maintain diversity and

are used to produce the ensemble architecture.

Stacking method [68] aims at training different classifiers on the target dataset by stacking one on

top of the other so that the newly trained classifier will correct the errors from the previous

classifier. Theoretically, this method involves all ensemble methods mentioned above.

In general, all ensemble methods contain two steps: generating a distribution of simple models

based on the original data and combining the distribution into one aggregated model. Figure 6-1

presents the basic structure of an ensemble model.

Figure 6-1: Ensemble Model Structure

According to the illustration, the manipulated target training dataset has been trained to generate a

series of classifiers (tier-1 classifiers), outputs obtained from these classifiers will then be feed into

a tier-2 classifier (also known as meta-classifier) and reorganized to get a more unbiased output.

TIER-2 CLASSIFIER

Tra

inin

g D

ata

Met

a C

lass

ifie

r

Data

M

an

ipu

lati

on

TIER-1 CLASSIFIERS C1

Ct-1

Ct

Ct+1

CT

69

The potential idea behind this structure is to learn data in a more unbiased way based on the

knowledge learnt by different classifiers. For example, if a classifier learns a wrong feature pattern

from the dataset, it might cause the classification error for new data that has the similar feature,

however, the tier-2 classifier of the ensemble model may learn things correctly by organizing the

knowledge from all classifiers in an unbiased way so as to compensate for the individual classifier

weaknesses and generate the right classification result.

In general, the ability to provide a trade-off between bias and variances among base models as well

as reducing the risk of overfitting provides more superiority to the ensemble model than any single

one [69]. It has now been successfully applied not only in supervised learning fields but also in

many unsupervised learning fields such as probability density estimation.

6.2 Ensemble Combination Methods Used for TB Detection

Our experiment has implemented the neural network ensemble [70] which is an ensemble learning

paradigm that jointly use six CNN models with different structures to improve the overall

performance on the detection of TB. Two different algebraic combiners, linear average based

ensemble and voting based ensemble, are implemented respectively as the tier-2 classifier to

generate the final classification result with higher accuracy and more stable performance.

6.2.1 Linear Average Based Ensemble

The idea of a linear average based ensemble is simple and intuitive, that is to calculate the linear

average of the output from all component classifiers. This method can efficiently prevent the

overfitting problem existed in one single model so that the ensemble model will be more

generalized during classification.

70

An example is shown in Figure 6-2. The green line represents the result given by a single classifier

that has been trained on the binary classification (blue dots vs. red dots) task. It shows an obvious

trend of overfitting by presenting an overly complex model which means that the classifier has

picked up the noise or random fluctuations exist in the training data. This over-complicated model

will receive a poor performance on new data so that it limits the ability to generalize. After

averaging the results that come from different classifiers trained on the same data, a smooth black

curve shows up as the final borderline to separate the dots. This simple black curve cancels the

negative impact of the noise and increases the classification accuracy on new data.

Figure 6-2: Overfitted model and linear averaged model

6.2.2 Voting Based Ensemble

Voting based ensemble generates the final result based on the common output provided by the

majority of the component classifiers. When there exist multiple groups of majorities, then the

output will be the averaging of the probabilities calculated by individual models. This method

requires diverse classifiers so that errors which a single model has fallen into will no be aggregated.

71

6.3 Experiment Descriptions

In this chapter, the comparison of the performance on both abnormalities checking and TB related

disease detection among two ensemble models and their component CNN models have been

presented and analyzed from the context of classification. Quality of the performance is evaluated

statistically using the following measures: accuracy, specificity, recall, precision, F1-score and

AUC. The experiment sets related to train-valid separation and parameters setting remains the

same as mentioned in Chapter 5.

6.4 Evaluation Metrics

Evaluation metrics is a popular method that has been used to provide an objective assessment of

the deep learning models through statistical measures. For classification tasks, the quality of

detection is measured by accuracy, recall, specificity, precision, F1-score, AUC and confusion

matrix. Among these evaluations, recall, specificity, precision as well as F1-score are mainly used

to assess the performance of binary classifications, the rest can be used for both binary and multi-

class classification.

As for the most commonly used measurement, accuracy is the ratio of the number of correct

predictions to the total number of input samples. During the abnormality detection, the number of

pathological samples that have been correctly classified is called true positive (TP) and the number

of correctly classified normal samples is called true negative (TN). The number of pathological

samples that have been incorrectly classified as normal is called false negative (FN) and the

number of incorrectly classified normal samples is called false positive (FP).

Recall is a measurement of how many people with TB are correctly identified as having the TB

related manifestation and is also called true positive rate (TPR). Specificity aims to evaluate how

72

many healthy people are correctly identified as not having the TB related manifestation, and is

also called true negative rate (TNR). Precision is used to measure how many people with TB have

been correctly identified among all the samples that have been identified as having TB. As the

harmonic mean between precision and recall, F1-score is a measurement of how precise and how

robust the classifier is. The calculations of recall, specificity, precision and F1-score are given as

follows:

𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃𝑅 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑅 =𝑇𝑁

𝑇𝑁 + 𝐹𝑃

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃

𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

The statistical interpretation of AUC is that if choosing a case under a certain class randomly, the

probability that the selected class outranks other classes. This value is independent of the threshold

set for classification because it only considers the rank of each prediction.

An ideal classifier is supposed to attain a high value for all of the introduced evaluation metrics.

6.5 Results and Discussion

6.5.1 Lung Abnormality Detection

Table 6-1, Table 6-2, Table 6-3 and Table 6-4 present the validation and testing accuracy as well

as the evaluation metrics introduced in 6.4 for different CNN models with train/valid ratios equal

to 7:3, 8:2, 9:1 respectively on Montgomery County Chest X-Ray Dataset.

73

Table 6-1: Ratio Validation and Testing Accuracy Results on Montgomery Chest X-Ray Dataset

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 82.50% 75.00% 84.00% 100% 86.67% 87.50%

VGG19 90.00% 75.00% 88.00% 87.50% 93.33% 75.00%

Inception V3 85.00% 87.50% 84.00% 87.50% 86.67% 87.50%

ResNet34 87.50% 87.50% 88.00% 62.50% 86.67% 100%

ResNet50 90.00% 87.50% 88.00% 75.00% 93.33% 100%

ResNet101 87.50% 75.00% 88.00% 75.00% 93.33% 87.50%

Ensemble-Linear 92.50% 87.50% 92.00% 100% 93.33% 100%

Ensemble-Voting 92.50% 87.50% 92.00% 100% 93.33% 100%

Modified

Model

VGG16 92.50% 100% 92.00% 87.50% 93.33% 100%

VGG19 97.50% 100% 96.00% 87.50% 100% 87.50%

Inception V3 90.00% 100% 92.00% 100% 93.33% 62.50%

ResNet34 95.00% 62.50% 92.00% 100% 93.33% 75.00%

ResNet50 95.00% 87.50% 96.00% 50.00% 100% 75.00%

ResNet101 95.00% 87.50% 96.00% 87.50% 100% 87.50%

Ensemble-Linear 97.50% 100% 100% 100% 100% 100%

Ensemble-Voting 97.50% 100% 100% 100% 100% 100%

Modified

Model

Fine-tuned

by ABC

VGG16 95.00% 87.50% 92.00% 87.50% 100% 75.00%

VGG19 100 % 100% 100% 75.00% 100% 87.50%

Inception V3 95.00% 62.50% 92.00% 100% 93.33% 87.50%

ResNet34 97.50% 75.00% 96.00% 75.00% 100% 100%

ResNet50 97.50% 87.50% 100% 87.50% 100% 87.50%

ResNet101 100% 100% 100% 100% 100% 87.50%

Ensemble-Linear 100% 100% 100% 100% 100% 100%

Ensemble-Voting 100% 100% 100% 100% 100% 100%

74

Table 6-2: Statistical Model Analysis with Train/Valid Ratio = 7:3 on Montgomery Chest X-Ray Dataset

CNN

Model

Train/Valid Ratio = 7:3

Specificity Recall Precision F1-Score AUC

Raw Model

VGG16 0.913 0.706 0.857 0.774 0.926

VGG19 0.870 0.941 0.842 0.889 0.916

InceptionV3 0.783 0.941 0.762 0.842 0.882

ResNet34 0.913 0.824 0.875 0.848 0.941

ResNet50 0.957 0.824 0.933 0.875 0.982

ResNet101 0.783 1.000 0.773 0.872 0.923

Ensemble-Linear 0.913 0.941 0.889 0.914 0.972

Ensemble-Voting 0.913 0.941 0.889 0.914 0.941

Modified

Model

VGG16 0.957 0.882 0.938 0.909 0.980

VGG19 1.000 0.941 1.000 0.970 1.000

Inception V3 0.913 0.882 0.882 0.882 0.967

ResNet34 1.000 0.882 1.000 0.938 0.972

ResNet50 0.913 0.941 0.889 0.914 0.987

ResNet101 0.913 1.000 0.895 0.944 0.969

Ensemble-Linear 0.957 1.000 0.944 0.971 1.000

Ensemble-Voting 0.957 1.000 0.944 0.971 1.000

Modified

Model Fine-

tuned by

ABC

VGG16 0.957 0.941 0.941 0.941 0.992

VGG19 1.000 1.000 1.000 1.000 1.000

Inception V3 0.957 0.941 0.941 0.941 0.990

ResNet34 1.000 0.941 1.000 0.970 1.000

ResNe50 1.000 0.941 1.000 0.970 0.985

ResNet101 1.000 1.000 1.000 1.000 1.000

Ensemble-Linear 1.000 1.000 1.000 1.000 1.000

Ensemble-Voting 1.000 1.000 1.000 1.000 1.000

75


CNN

Model



Raw Model

VGG16 1.000 0.600 1.000 0.750 0.940

VGG19 0.933 0.800 0.889 0.842 0.927

InceptionV3 0.733 1.000 0.714 0.833 0.900

ResNet34 0.867 0.900 0.818 0.857 0.920

ResNet50 0.800 1.000 0.769 0.870 0.900

ResNet101 0.867 0.900 0.818 0.857 0.900

Ensemble-Linear 0.867 1.000 0.833 0.909 0.973

Ensemble-Voting 0.867 1.000 0.833 0.909 0.967

Modified

Model

VGG16 0.867 1.000 0.833 0.909 0.987

VGG19 0.933 1.000 0.909 0.952 1.000

Inception V3 0.867 1.000 0.833 0.909 0.953

ResNet34 0.867 1.000 0.833 0.909 0.940

ResNet50 1.000 0.900 1.000 0.947 0.973

ResNet101 1.000 0.900 1.000 0.947 0.973

Ensemble-Linear 1.000 1.000 1.000 1.000 1.000

Ensemble-Voting 1.000 1.000 1.000 1.000 1.000

Modified

Model Fine-

tuned by

ABC

VGG16 0.867 1.000 0.833 0.909 0.987

VGG19 1.000 1.000 1.000 1.000 1.000

Inception V3 0.867 1.000 0.833 0.909 0.953

ResNet34 1.000 0.900 1.000 0.947 0.947

ResNe50 1.000 1.000 1.000 1.000 1.000

ResNet101 1.000 1.000 1.000 1.000 1.000

Ensemble-Linear 1.000 1.000 1.000 1.000 1.000

Ensemble-Voting 1.000 1.000 1.000 1.000 1.000

76


CNN

Model



Raw Model

VGG16 0.900 0.800 0.800 0.800 0.860

VGG19 0.900 1.000 0.833 0.909 1.000

InceptionV3 0.800 1.000 0.714 0.833 0.980

ResNet34 0.900 0.800 0.800 0.800 0.940

ResNet50 0.900 1.000 0.833 0.909 0.980

ResNet101 1.000 0.800 1.000 0.889 0.880

Ensemble-Linear 1.000 0.800 1.000 0.889 0.940

Ensemble-Voting 1.000 0.800 1.000 0.889 0.980

Modified

Model

VGG16 0.900 1.000 0.833 0.909 0.980

VGG19 1.000 1.000 1.000 1.000 1.000

Inception V3 0.900 1.000 0.833 0.909 1.000

ResNet34 0.900 1.000 0.833 0.909 1.000

ResNet50 1.000 1.000 1.000 1.000 1.000

ResNet101 1.000 1.000 1.000 1.000 1.000

Ensemble-Linear 1.000 1.000 1.000 1.000 1.000

Ensemble-Voting 1.000 1.000 1.000 1.000 1.000

Modified

Model Fine-

tuned by

ABC

VGG16 1.000 1.000 1.000 1.000 1.000

VGG19 1.000 1.000 1.000 1.000 1.000

Inception V3 0.900 1.000 0.833 0.909 1.000

ResNet34 1.000 1.000 1.000 1.000 1.000

ResNe50 1.000 1.000 1.000 1.000 1.000

ResNet101 1.000 1.000 1.000 1.000 1.000

Ensemble-Linear 1.000 1.000 1.000 1.000 1.000

Ensemble-Voting 1.000 1.000 1.000 1.000 1.000

From the above tables, it is obvious that for single CNN models, the consistency of their

performances on the validation set and testing set is unstable and unpredictable. Some models

generate a higher detection accuracy on the validation set than others while presenting a poor

performance during the diagnosis of the testing set. For example, during the fine-tuning process

when train/valid ratio equals to 7:3, Inception V3 and ResNet34 achieve an accuracy of 95% and

97.5% on validation set respectively but only get 62.5% and 75% on testing set. The

implementation of two ensemble models greatly improves the stability by providing consistent

performance on both sets with higher accuracy.

77

For both linear average and voting based ensemble models, the classification accuracy reaches to

100% for almost all validation as well as testing case under all train/valid ratios conditions during

the model modification and fine-tuning steps.

Table 6-5, Table 6-6, Table 6-7 and Table 6-8 present the validation and testing accuracy as well

as other statistical measurements among different CNN models on Shenzhen Hospital Chest X-

Ray Dataset with different with train/valid ratios.

Table 6-5: Ratio Validation and Testing Accuracy Results on Shenzhen Hospital Chest X-Ray Dataset

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 87.03% 81.82% 89.23% 77.27% 90.77% 68.18%

VGG19 87.03% 77.27% 89.23% 50.00% 90.77% 63.64%

Inception V3 87.03% 81.82% 90.77% 72.73% 90.77% 72.73%

ResNet34 88.11% 77.27% 91.54% 72.73% 92.31% 86.36%

ResNet50 87.03% 68.18% 90.77% 77.27% 92.31% 86.36%

ResNet101 88.11% 72.73% 91.54% 81.82% 93.85% 77.27%

Ensemble-Linear 89.19% 81.82% 93.08% 81.82% 93.85% 86.36%

Ensemble-Voting 89.19% 81.82% 93.08% 81.82% 93.85% 86.36%

Modified

Model

VGG16 89.73% 86.36% 92.31% 81.82% 92.31% 86.36%

VGG19 91.89% 86.36% 92.31% 90.91% 93.85% 81.82%

Inception V3 91.35% 86.36% 92.31% 86.36% 93.85% 77.27%

ResNet34 91.89% 86.36% 95.38% 72.73% 95.38% 77.27%

ResNet50 91.35% 86.36% 94.62% 68.18% 95.38% 90.91%

ResNet101 91.89% 77.27% 96.15% 86.36% 96.92% 77.27%

Ensemble-Linear 92.43% 90.91% 96.92% 90.91% 96.92% 90.91%

Ensemble-Voting 93.51% 90.91% 96.92% 90.91% 96.92% 90.91%

Modified

Model

Fine-tuned

by ABC

VGG16 92.43% 86.36% 92.31% 81.82% 93.85% 72.73%

VGG19 93.51% 90.91% 93.08% 81.82% 95.38% 86.36%

Inception V3 91.89% 77.27% 93.08% 86.36% 93.85% 77.27%

ResNet34 92.43% 72.73% 96.15% 81.82% 96.92% 81.82%

ResNet50 94.05% 86.36% 95.39% 90.90% 96.92% 86.36%

ResNet101 92.97% 81.82% 96.92% 90.90% 98.46% 95.45%

Ensemble-Linear 94.59% 90.91% 97.69% 95.45% 98.46% 95.45%

Ensemble-Voting 94.59% 90.91% 97.69% 95.45% 98.46% 95.45%

78

Table 6-6: Statistical Model Analysis with Train/Valid Ratio = 7:3 on Shenzhen Hospital Chest X-Ray Dataset

CNN

Model



Raw Model

VGG16 0.911 0.832 0.908 0.868 0.914

VGG19 0.933 0.811 0.928 0.865 0.932

InceptionV3 0.889 0.853 0.890 0.871 0.915

ResNet34 0.878 0.884 0.884 0.884 0.936

ResNet50 0.911 0.832 0.908 0.868 0.922

ResNet101 0.900 0.863 0.901 0.882 0.923

Ensemble-Linear 0.911 0.874 0.912 0.892 0.940

Ensemble-Voting 0.911 0.874 0.912 0.892 0.936

Modified

Model

VGG16 0.956 0.842 0.952 0.894 0.944

VGG19 0.900 0.937 0.908 0.922 0.976

Inception V3 0.933 0.895 0.934 0.914 0.961

ResNet34 0.889 0.947 0.900 0.923 0.969

ResNet50 0.900 0.926 0.907 0.917 0.974

ResNet101 0.967 0.874 0.965 0.917 0.965

Ensemble-Linear 0.922 0.926 0.926 0.926 0.985

Ensemble-Voting 0.944 0.926 0.946 0.936 0.978

Modified

Model Fine-

tuned by

ABC

VGG16 0.911 0.937 0.918 0.927 0.975

VGG19 0.933 0.937 0.937 0.937 0.973

Inception V3 0.900 0.937 0.908 0.922 0.963

ResNet34 0.944 0.905 0.945 0.925 0.974

ResNe50 0.944 0.937 0.947 0.942 0.964

ResNet101 0.911 0.947 0.918 0.933 0.979

Ensemble-Linear 0.956 0.937 0.957 0.947 0.986

Ensemble-Voting 0.956 0.937 0.957 0.947 0.986

79


CNN

Model



Raw Model

VGG16 0.900 0.886 0.912 0.899 0.965

VGG19 0.967 0.829 0.967 0.892 0.938

InceptionV3 0.917 0.900 0.926 0.913 0.951

ResNet34 0.900 0.929 0.915 0.922 0.942

ResNet50 0.900 0.914 0.914 0.914 0.965

ResNet101 0.900 0.929 0.915 0.922 0.960

Ensemble-Linear 0.933 0.929 0.942 0.935 0.975

Ensemble-Voting 0.933 0.929 0.942 0.935 0.974

Modified

Model

VGG16 0.933 0.914 0.941 0.928 0.975

VGG19 0.950 0.900 0.955 0.926 0.978

Inception V3 0.933 0.914 0.941 0.928 0.971

ResNet34 0.983 0.929 0.985 0.956 0.983

ResNet50 0.933 0.957 0.985 0.956 0.980

ResNet101 0.983 0.943 0.985 0.964 0.988

Ensemble-Linear 0.983 0.957 0.985 0.971 0.990

Ensemble-Voting 0.983 0.957 0.985 0.971 0.990

Modified

Model Fine-

tuned by

ABC

VGG16 0.933 0.914 0.941 0.928 0.964

VGG19 0.933 0.929 0.942 0.935 0.978

Inception V3 0.917 0.943 0.930 0.936 0.973

ResNet34 0.967 0.957 0.971 0.964 0.985

ResNe50 0.933 0.971 0.944 0.958 0.986

ResNet101 0.950 0.986 0.958 0.972 0.988

Ensemble-Linear 0.967 0.986 0.972 0.979 0.991

Ensemble-Voting 0.967 0.986 0.972 0.979 0.990

80


CNN

Model



Raw Model

VGG16 0.933 0.886 0.939 0.912 0.949

VGG19 0.967 0.857 0.968 0.909 0.945

InceptionV3 0.867 0.943 0.892 0.917 0.970

ResNet34 0.933 0.914 0.941 0.928 0.956

ResNet50 0.933 0.914 0.941 0.928 0.956

ResNet101 0.933 0.943 0.943 0.943 0.969

Ensemble-Linear 0.933 0.943 0.943 0.943 0.974

Ensemble-Voting 0.933 0.943 0.943 0.943 0.960

Modified

Model

VGG16 0.967 0.886 0.969 0.925 0.969

VGG19 0.933 0.943 0.943 0.943 0.982

Inception V3 0.900 0.971 0.919 0.944 0.980

ResNet34 0.967 0.943 0.971 0.957 0.980

ResNet50 0.933 0.971 0.944 0.958 0.979

ResNet101 0.967 0.971 0.971 0.971 0.982

Ensemble-Linear 0.967 0.971 0.971 0.971 0.984

Ensemble-Voting 0.967 0.971 0.971 0.971 0.984

Modified

Model Fine-

tuned by

ABC

VGG16 0.967 0.914 0.970 0.941 0.976

VGG19 0.967 0.943 0.971 0.957 0.976

Inception V3 0.900 0.971 0.919 0.944 0.980

ResNet34 0.933 1.000 0.946 0.972 0.990

ResNe50 1.000 0.943 1.000 0.971 0.991

ResNet101 1.000 0.971 1.000 0.986 0.999

Ensemble-Linear 1.000 0.971 1.000 0.986 0.994

Ensemble-Voting 1.000 0.971 1.000 0.986 0.990

The result from the above four tables shows that unstable problems still exist during the

classification of CXR images. The two ensemble models have solved this problem by providing a

stable performance with the highest accuracy and evaluation metrics values during the prediction

on the target dataset within all three improvement steps under various train/valid ratios conditions.

The difference in performances provided by both linear average and voting based ensemble models

is very little. Yet in general, the linear average based ensemble model shows slightly better

performances compare to the voting-based one.

81

Table 6-9, Table 6-10, Table 6-11 and Table 6-12 present the validation and testing accuracy as

well as other statistical measurements among different CNN models on NIH Chest X-Ray8 Dataset

with different with train/valid ratios.

Table 6-9: Ratio Validation and Testing Accuracy Results on NIH Chest X-Ray8 Dataset

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 80.31% 90.11% 83.09% 93.03% 88.16% 92.33%

VGG19 81.17% 90.76% 82.66% 92.37% 87.36% 91.72%

Inception V3 81.64% 91.94% 84.06% 91.50% 88.54% 93.86%

ResNet34 82.54% 92.46% 85.38% 94.77% 88.07% 93.68%

ResNet50 80.92% 92.37% 83.35% 91.24% 89.78% 92.98%

ResNet101 82.20% 92.64% 83.81% 89.32% 89.48% 94.60%

Ensemble-Linear 83.28% 94.25% 85.76% 94.82% 90.60% 95.73%

Ensemble-Voting 83.09% 94.38% 85.74% 94.90% 90.60% 95.89%

Modified

Model

VGG16 87.29% 96.69% 90.22% 97.69% 93.69% 97.91%

VGG19 87.68% 97.69% 90.38% 97.30% 93.22% 98.00%

Inception V3 87.99% 96.51% 91.09% 98.82% 93.60% 97.65%

ResNet34 88.19% 97.21% 91.31% 97.21% 93.86% 97.82%

ResNet50 88.03% 97.86% 90.53% 97.86% 94.42% 98.17%

ResNet101 88.44% 97.99% 90.87% 97.78% 94.06% 98.61%

Ensemble-Linear 89.19% 98.87% 91.87% 99.08% 95.07% 99.35%

Ensemble-Voting 89.11% 98.78% 91.79% 98.95% 94.96% 99.30%

Modified

Model

Fine-tuned

by ABC

VGG16 87.97% 97.60% 91.15% 98.56% 93.82% 98.00%

VGG19 88.06% 96.73% 90.86% 98.39% 94.16% 98.04%

Inception V3 88.74% 98.21% 91.49% 97.91% 94.48% 97.65%

ResNet34 88.42% 97.91% 91.40% 97.08% 94.81% 98.26%

ResNet50 88.84% 98.26% 90.96% 98.61% 94.61% 98.87%

ResNet101 88.69% 97.39% 90.97% 98.78% 94.12% 96.82%

Ensemble-Linear 89.56% 98.78% 92.07% 99.13% 95.49% 99.43%

Ensemble-Voting 89.44% 98.87% 91.93% 99.22% 95.43% 99.52%

82

Table 6-10: Statistical Model Analysis with Train/Valid Ratio = 7:3 on Chest X-Ray8 Dataset

CNN

Model



Raw Model

VGG16 0.911 0.635 0.821 0.716 0.848

VGG19 0.892 0.687 0.803 0.741 0.855

InceptionV3 0.906 0.677 0.822 0.743 0.862

ResNet34 0.929 0.665 0.857 0.749 0.871

ResNet50 0.929 0.623 0.849 0.719 0.855

ResNet101 0.932 0.651 0.859 0.741 0.869

Ensemble-Linear 0.940 0.666 0.877 0.757 0.882

Ensemble-Voting 0.940 0.662 0.875 0.754 0.878

Modified

Model

VGG16 0.953 0.749 0.910 0.822 0.912

VGG19 0.970 0.732 0.939 0.823 0.914

Inception V3 0.939 0.787 0.893 0.837 0.923

ResNet34 0.956 0.766 0.918 0.835 0.923

ResNet50 0.955 0.765 0.915 0.833 0.923

ResNet101 0.958 0.770 0.922 0.839 0.924

Ensemble-Linear 0.969 0.771 0.942 0.848 0.931

Ensemble-Voting 0.969 0.769 0.942 0.847 0.930

Modified

Model Fine-

tuned by

ABC

VGG16 0.950 0.770 0.908 0.834 0.920

VGG19 0.939 0.789 0.893 0.838 0.924

Inception V3 0.964 0.769 0.931 0.842 0.924

ResNet34 0.962 0.763 0.928 0.873 0.927

ResNe50 0.961 0.776 0.927 0.845 0.928

ResNet101 0.948 0.792 0.907 0.846 0.929

Ensemble-Linear 0.967 0.785 0.938 0.855 0.934

Ensemble-Voting 0.966 0.783 0.937 0.853 0.933

83

Table 6-11: Statistical Model Analysis with Train/Valid Ratio = 8:2 on NIH Chest X-Ray8 Dataset

CNN

Model



Raw Model

VGG16 0.935 0.669 0.868 0.756 0.884

VGG19 0.941 0.649 0.875 0.745 0.875

InceptionV3 0.909 0.734 0.838 0.783 0.892

ResNet34 0.952 0.700 0.904 0.789 0.899

ResNet50 0.898 0.733 0.822 0.775 0.884

ResNet101 0.894 0.752 0.819 0.784 0.891

Ensemble-Linear 0.947 0.718 0.897 0.798 0.909

Ensemble-Voting 0.946 0.718 0.896 0.797 0.905

Modified

Model

VGG16 0.962 0.809 0.932 0.866 0.942

VGG19 0.945 0.839 0.908 0.872 0.944

Inception V3 0.969 0.820 0.944 0.878 0.951

ResNet34 0.944 0.865 0.908 0.886 0.954

ResNet50 0.958 0.824 0.926 0.872 0.945

ResNet101 0.970 0.816 0.945 0.876 0.948

Ensemble-Linear 0.966 0.845 0.941 0.890 0.955

Ensemble-Voting 0.969 0.837 0.946 0.888 0.954

Modified

Model Fine-

tuned by

ABC

VGG16 0.971 0.819 0.948 0.879 0.951

VGG19 0.950 0.844 0.916 0.878 0.950

Inception V3 0.955 0.852 0.924 0.887 0.952

ResNet34 0.948 0.860 0.914 0.887 0.955

ResNe50 0.963 0.827 0.934 0.877 0.948

ResNet101 0.970 0.816 0.945 0.876 0.948

Ensemble-Linear 0.970 0.844 0.947 0.893 0.958

Ensemble-Voting 0.969 0.841 0.946 0.891 0.957

84

Table 6-12: Statistical Model Analysis with Train/Valid Ratio = 9:1 on NIH Chest X-Ray8 Dataset

CNN

Model



Raw Model

VGG16 0.935 0.799 0.887 0.841 0.929

VGG19 0.925 0.793 0.872 0.831 0.922

InceptionV3 0.949 0.787 0.908 0.843 0.932

ResNet34 0.936 0.794 0.889 0.839 0.931

ResNet50 0.935 0.840 0.893 0.865 0.945

ResNet101 0.962 0.789 0.931 0.854 0.936

Ensemble-Linear 0.960 0.822 0.929 0.873 0.947

Ensemble-Voting 0.962 0.818 0.933 0.872 0.946

Modified

Model

VGG16 0.975 0.877 0.958 0.916 0.965

VGG19 0.985 0.851 0.973 0.908 0.965

Inception V3 0.965 0.890 0.943 0.916 0.966

ResNet34 0.974 0.883 0.956 0.918 0.965

ResNet50 0.982 0.886 0.969 0.926 0.970

ResNet101 0.973 0.891 0.954 0.922 0.970

Ensemble-Linear 0.987 0.894 0.979 0.934 0.972

Ensemble-Voting 0.987 0.892 0.977 0.933 0.972

Modified

Model Fine-

tuned by

ABC

VGG16 0.976 0.879 0.960 0.918 0.965

VGG19 0.977 0.887 0.961 0.922 0.968

Inception V3 0.970 0.906 0.950 0.928 0.974

ResNet34 0.978 0.901 0.964 0.931 0.972

ResNe50 0.978 0.896 0.964 0.929 0.970

ResNet101 0.961 0.911 0.937 0.924 0.969

Ensemble-Linear 0.985 0.909 0.974 0.940 0.976

Ensemble-Voting 0.984 0.908 0.974 0.940 0.975

The statistical analysis presented by the above four tables shows that during each improvement

steps, the difference in the performance of each CNN model has been decreased compare to the

results on the first two datasets. This is because the dataset used here is so far the largest publicly

available CXR dataset, the rapid increment of the number of images improve the quality of the

training process and therefore generate a positive influence on model stabilization. The

employment of the ensemble models further improves the stability as well as the overall

performance on lung abnormality detections. However, the complexity of the diagnoses increased

with the increasing amount of data, the prediction accuracy is lower compared to the results on the

85

first two datasets. Besides, the linear average based ensemble model presents slightly better

performance than the voting-based ensemble model under various train/valid ratios conditions.

6.5.2 TB Related Disease Diagnosis

Table 6-13 shows the validation and testing accuracy among different CNN models on NIH CXR8

Dataset with different with train/valid ratios for the diagnosis of the specific TB manifestations

among seven TB related diseases.

Table 6-13: Ratio Validation Accuracy and Testing Results on NIH Chest X-Ray8 Dataset for Specific TB Related

Disease Diagnosis

CNN

Model


Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Valid

Accuracy

Test

Accuracy

Raw

Model

VGG16 51.29% 54.05% 54.46% 53.88% 56.80% 58.82%

VGG19 50.15% 54.00% 53.58% 52.71% 56.83% 57.41%

Inception V3 51.23% 48.24% 53.60% 52.71% 58.54% 57.65%

ResNet34 52.22% 53.41% 54.89% 56.47% 58.20% 59.53%

ResNet50 52.42% 48.00% 54.46% 48.00% 57.71% 59.06%

ResNet101 51.75% 53.18% 55.86% 54.35% 58.97% 61.18%

Ensemble-Linear 55.70% 54.59% 58.73% 57.65% 63.57% 62.35%

Ensemble-Voting 55.50% 55.76% 58.65% 56.94% 62.66% 63.29%

Modified

Model

VGG16 81.85% 81.65% 85.96% 88.94% 86.40% 88.71%

VGG19 81.97% 77.65% 86.12% 83.29% 86.46% 86.35%

Inception V3 81.45% 80.47% 84.97% 85.41% 87.40% 90.59%

ResNet34 82.42% 84.24% 85.44% 86.35% 86.43% 88.70%

ResNet50 81.10% 80.24% 84.01% 84.24% 85.74% 88.24%

ResNet101 82.70% 80.71% 83.87% 83.29% 84.94% 82.59%

Ensemble-Linear 88.74% 89.41% 90.19% 91.06% 91.11% 93.88%

Ensemble-Voting 88.53% 89.65% 90.11% 91.29% 90.94% 93.18%

Modified

Model

Fine-tuned

by ABC

VGG16 82.63% 83.06% 86.26% 87.29% 87.40% 87.53%

VGG19 82.47% 80.94% 86.68% 84.94% 86.94% 87.53%

Inception V3 82.81% 85.18% 85.62% 87.53% 88.86% 88.24%

ResNet34 83.62% 84.00% 86.07% 86.59% 86.54% 85.65%

ResNet50 82.66% 85.41% 84.11% 83.76% 86.43% 86.59%

ResNet101 83.94% 84.00% 84.54% 78.59% 85.00% 84.71%

Ensemble-Linear 89.27% 90.12% 90.54% 92.47% 91.29% 95.06%

Ensemble-Voting 89.01% 90.59% 90.44% 91.76% 91.14% 94.82%

86

Table 6-14, Table 6-15 and Table 6-16 present the AUC score during the detection of each disease

among all CNN models when train/valid ratio equals to 7:3, 8:2 and 9:1 respectively.

Table 6-14: AUC Scores with Train/Valid Ratio = 7:3 on NIH Chest X-Ray8 Dataset

AUC Score

CNN Models



Thickening

Raw

Model

VGG16 0.866 0.899 0.839 0.900 0.832 0.801 0.810

VGG19 0.851 0.892 0.865 0.829 0.892 0.824 0.834

Inception V3 0.874 0.903 0.834 0.897 0.790 0.816 0.810

ResNet34 0.877 0.905 0.841 0.901 0.834 0.813 0.824

ResNet50 0.887 0.905 0.837 0.896 0.823 0.801 0.823

ResNet101 0.883 0.899 0.838 0.891 0.805 0.811 0.811

Ensemble-Linear 0.902 0.921 0.863 0.915 0.856 0.833 0.838

Ensemble-Voting 0.899 0.920 0.857 0.909 0.848 0.830 0.836

Modified

Model

VGG16 0.958 0.981 0.965 0.975 0.979 0.975 0.961

VGG19 0.968 0.976 0.961 0.967 0.982 0.982 0.964

Inception V3 0.952 0.979 0.963 0.969 0.976 0.974 0.965

ResNet34 0.977 0.981 0.963 0.971 0.982 0.970 0.965

ResNet50 0.959 0.980 0.958 0.967 0.975 0.973 0.960

ResNet101 0.969 0.981 0.965 0.972 0.982 0.974 0.970

Ensemble-Linear 0.978 0.990 0.979 0.985 0.992 0.991 0.981

Ensemble-Voting 0.975 0.989 0.975 0.983 0.989 0.988 0.979

Modified

Model

Fine-

tuned by

ABC

VGG16 0.959 0.979 0.963 0.978 0.984 0.981 0.965

VGG19 0.962 0.980 0.964 0.977 0.983 0.982 0.965

Inception V3 0.950 0.982 0.968 0.973 0.982 0.979 0.962

ResNet34 0.967 0.980 0.971 0.975 0.986 0.979 0.966

ResNet50 0.964 0.981 0.968 0.975 0.980 0.978 0.969

ResNet101 0.954 0.982 0.967 0.976 0.985 0.981 0.972

Ensemble-Linear 0.975 0.990 0.979 0.988 0.993 0.993 0.982

Ensemble-Voting 0.971 0.989 0.977 0.987 0.991 0.990 0.980

87


AUC Score

CNN Models



Thickening

Raw

Model

VGG16 0.912 0.916 0. 852 0.897 0.853 0.804 0.844

VGG19 0.886 0.913 0.846 0.904 0.859 0.803 0.838

Inception V3 0.904 0.907 0.843 0.913 0.827 0.812 0.837

ResNet34 0.901 0.914 0.864 0.900 0.856 0.813 0.848

ResNet50 0.913 0.905 0.851 0.898 0.839 0.815 0.856

ResNet101 0.914 0.923 0.862 0.907 0.853 0.824 0.855

Ensemble-Linear 0.932 0.933 0.876 0.920 0.885 0.839 0.873

Ensemble-Voting 0.929 0.931 0.874 0.917 0.883 0.835 0.872

Modified

Model

VGG16 0.962 0.986 0.969 0.980 0.987 0.987 0.975

VGG19 0.970 0.985 0.976 0.980 0.986 0.988 0.966

Inception V3 0.967 0.983 0.968 0.979 0.977 0.985 0.972

ResNet34 0.962 0.984 0.976 0.979 0.982 0.986 0.971

ResNet50 0.963 0.977 0.967 0.979 0.979 0.984 0.972

ResNet101 0.967 0.982 0.971 0.978 0.980 0.985 0.968

Ensemble-Linear 0.978 0.991 0.983 0.990 0.993 0.995 0.982

Ensemble-Voting 0.971 0.990 0.981 0.989 0.989 0.993 0.981

Modified

Model

Fine-

tuned by

ABC

VGG16 0.957 0.987 0.978 0.981 0.987 0.989 0.973

VGG19 0.973 0.986 0.978 0.978 0.987 0.990 0.972

Inception V3 0.970 0.986 0.978 0.979 0.984 0.988 0.972

ResNet34 0.962 0.987 0.971 0.983 0.988 0.988 0.974

ResNet50 0.946 0.981 0.968 0.977 0.958 0.974 0.971

ResNet101 0.970 0.984 0.976 0.979 0.985 0.987 0.977

Ensemble-Linear 0.979 0.993 0.984 0.990 0.993 0.996 0.985

Ensemble-Voting 0.974 0.991 0.981 0.989 0.990 0.994 0.982

88


AUC Score

CNN Models



Thickening

Raw

Model

VGG16 0.935 0.927 0.855 0.903 0.863 0.831 0.869

VGG19 0.921 0.907 0.865 0.918 0.869 0.824 0.868

Inception V3 0.936 0.917 0.870 0.921 0.859 0.841 0.864

ResNet34 0.935 0.920 0.881 0.918 0.862 0.840 0.875

ResNet50 0.922 0.917 0.865 0.911 0.873 0.839 0.867

ResNet101 0.932 0.921 0.882 0.916 0.880 0.832 0.891

Ensemble-Linear 0.956 0.938 0.900 0.931 0.907 0.866 0.901

Ensemble-Voting 0.952 0.935 0.899 0.929 0.899 0.860 0.898

Modified

Model

VGG16 0.968 0.987 0.971 0.987 0.984 0.989 0.972

VGG19 0.966 0.984 0.972 0.978 0.985 0.989 0.966

Inception V3 0.960 0.982 0.971 0.986 0.984 0.984 0.963

ResNet34 0.960 0.986 0.973 0.985 0.991 0.984 0.966

ResNet50 0.974 0.985 0.974 0.981 0.986 0.986 0.964

ResNet101 0.968 0.985 0.967 0.977 0.989 0.989 0.967

Ensemble-Linear 0.976 0.992 0.983 0.992 0.994 0.996 0.978

Ensemble-Voting 0.973 0.990 0.982 0.991 0.990 0.994 0.972

Modified

Model

Fine-

tuned by

ABC

VGG16 0.960 0.985 0.972 0.984 0.988 0.990 0.963

VGG19 0.945 0.984 0.970 0.987 0.987 0.988 0.974

Inception V3 0.976 0.987 0.970 0.988 0.989 0.992 0.969

ResNet34 0.964 0.987 0.970 0.983 0.985 0.989 0.969

ResNet50 0.963 0.984 0.970 0.983 0.983 0.985 0.974

ResNet101 0.948 0.982 0.966 0.975 0.982 0.983 0.967

Ensemble-Linear 0.973 0.991 0.985 0.994 0.992 0.996 0.979

Ensemble-Voting 0.969 0.990 0.982 0.993 0.990 0.993 0.974

The accuracy results provided by Table 6-13 indicate a continuously increasing trend on the

accuracy achieved by each model for the diagnosis of specific TB manifestation among seven TB

related diseases during the improvement steps. Within each step, the two proposed ensemble

models provide stable performance on both validation and testing set with the highest accuracy

compare to each base CNN model. The accuracies achieved by ensemble models are significantly

higher than that by single CNN models under all train/valid ratio conditions. The highest accuracy

reaches to 91.29% on the validation set and 95.06% on the testing set provided by linear average

based ensemble model when train/valid ratio equals to 9:1.

89

From the AUC scores shown in Table 6-14, Table 6-15 and Table 6-16, we can see that among

seven different TB related diseases, consolidation, infiltration and pleural thickening have a lower

probability to be correctly detected by the CNN model compare to other diseases. The ensemble

models still present a stable performance and better probability of generating the prediction on the

right disease that outranks the others. The ensemble model based on linear average achieves a

slightly better performance than that based on voting mechanism.

6.6 Conclusion

In this chapter, the concept and working mechanism of the ensemble model have been proposed

and implemented to solve the unstable problem exists in CNN models during the process of TB

detection. During the experiment, both linear average based and voting based ensemble models

have been employed at each improvement steps to compare with individual models for both binary

classifications on abnormality detection as well as multi-class classification on the diagnose of

specific TB disease respectively. Evaluation metrics have been used to measure the overall

performance of each model on the given task.

The experiment shows that the ensemble models not only can improve the detection accuracy but

also can provide consistent and stable performance on both validation and testing set under all

conditions.

90

Chapter 7

Disease Localization via Class Activation Mapping

From the statistical results of the experiments that have been mentioned in chapter 5 and chapter

6, CNN architectures provide satisfactory performances with high accuracy in both TB

abnormality detection and specific TB manifestations diagnosis.

However, as a “black box” model, the characteristic of opaqueness makes the result that comes

from the CNN model not interpretable. This will greatly impact the application of CNN in medical

image detections. When implementing as a computer-aided detection system for medical purposes,

doctors and radiologists will not only focus on the predicted result given by the model but also pay

more attention to which part of the input data makes the model generates the judgment in order to

better understand the result from the model’s vision and make a more accurate conclusion based

on that.

Therefore, to ensure the reliability of the results and better resemble human decision-making, Class

Activation Mapping (CAM) which helps to reveal the extracted features of a CNN model into an

interpretable form will be discussed and implemented in this chapter. This method will be mainly

used as providing the visualization on the localizing TB manifestations, results given by different

CNN models (VGG16, VGG19, Inception V3, ResNet34, ResNet50, ResNet101 and ensemble

CNN models) will be displayed on CXRs from the NIH ChestX-Ray8 dataset.

7.1 Class Activation Mapping

The concept of class activation mapping was proposed by Zhou et al. in [71]. This method fully

utilized the remarkable ability of pattern recognizing and localization that exists in CNN models

91

to expose their implicit attention on the target image. Besides, with the simple processing of the

internal parameters within a CNN, it successfully integrates two different functions, image

classification and object localization, on the same model. The generated attention map based on

the input image identifies the regions that become the model’s main predicting criteria during the

classification process.

In a CNN model that contains a global average pooling layer, feature maps that are generated from

the last convolutional layer are processed with global average pooling to generate a vector. The

obtained vector will then be used to calculate the weighted summation with the parameters of the

fully connected layers to generate the output that can be used for classification. Therefore, the

weights from the last layer before the output can be projected back to the feature maps by

connecting with the pooling layer to identify areas which the model calculated as containing the

important information.

Figure 7-1 illustrates the working mechanism of class activation mapping. According to the figure,

the weighted sum of the final weights and their corresponding feature maps will be used to generate

the attention map of the input image.

.

.

.

.

.

.

Class 1

Class 2

Class n

W 2,2

W 2,1

W 2,m

Global Average Pooling

C

O

N

V

C

O

N

V

C

O

N

V

W 2,1 * + W 2,2 * + ... + W 2,m * =

Attention Map

92

Figure 7-1: Ensemble model structure

For a given input, suppose 𝑓𝑘(𝑥, 𝑦) represents the activation value of the neuron in the 𝑘-th feature

map of the last convolutional layer with spatial position (𝑥, 𝑦), for each feature map, after being

processed by global average pooling, the result will be:

𝐹𝑘 = ∑ 𝑓𝑘𝑥,𝑦

(𝑥, 𝑦)

If the input image belongs to class 𝑐, the result that comes from the layer before SoftMax which

indicates the score that the model evaluates for class 𝑐 will be:

𝑆𝑐 = ∑ 𝑤𝑘𝑐𝐹𝑘 = ∑ 𝑤𝑘

𝑐 ∑ 𝑓𝑘𝑥,𝑦

(𝑥, 𝑦)𝑘

=𝑘

∑ ∑ 𝑤𝑘𝑐𝑓𝑘(𝑥, 𝑦)

𝑘𝑥,𝑦

where 𝑤𝑘𝑐 represents the averaged weights of the 𝑘 -th feature map corresponding to class 𝑐 .

Moreover, since the bias doesn’t influence much on the generation of the attention map, this term

will no longer be considered during the calculation.

Therefore, from the above calculations, pixel values of the attention map on class 𝑐 can denoted

as:

𝑀𝑐(𝑥, 𝑦) = ∑ 𝑤𝑘𝑐𝑓𝑘

𝑘(𝑥, 𝑦)

By zooming in the attention map to the same size as the input image and overlaying one on the

other, the visualization of how much each region of the input affects the classification result will

be generated. This localization of detected objects makes the predicted result that come from the

“black box” model more interpretable and easier for researchers to understand the classification

process.

93

7.2 Experiment Setup

In our experiment, class activation mapping is implemented with the multi-class classification for

specific TB manifestation localization purposes. For each TB related manifestation (consolidation,

effusion, fibrosis, infiltration, mass, nodule and pleural thickening), we run the prediction results

together with class activation mapping on the six trained CNN models and two ensemble models

introduce in chapter 6 to make a parallel comparison on the general performance of disease

prediction and localization.

7.3 Results and Analysis

Figure 7-2 ∼ Figure 7-8 present both the prediction and localization results on various test cases

of TB manifestations given by CNN models that are trained on NIH Chest X-Ray8 dataset.

94

Figure 7-2: Diagnosis and localization of consolidation

Input Image

True Label: Consolidation

Disease Prediction and Localization by Single CNN Models

Disease Prediction and Localization by Ensembled CNN Models

95

Figure 7-3: Diagnosis and localization of effusion

Input Image

True Label: Effusion



96

Figure 7-4: Diagnosis and localization of fibrosis

Input Image

True Label: Fibrosis



97

Figure 7-5: Diagnosis and localization of infiltration

Input Image

True Label: Infiltration



98

Figure 7-6: Diagnosis and localization of mass

Input Image

True Label: Mass



99

Figure 7-7: Diagnosis and localization of nodule

Input Image

True Label: Nodule



100

Figure 7-8: Diagnosis and localization of pleural thickening

Input Image

True Label: Pleural Thickening



101

From the performances on different test cases shown above that are resulted from the six different

individual CNN models and their ensembled models for the diagnosis and localization of TB

manifestations, we can observe the following:

1. Not all the single models will provide the correct diagnosis for the input CXR image. For

example, during the prediction of infiltration, two CNN models, VGG16 and ResNet101, both

present wrong predictions with fibrosis, and ResNet101 provides a wrong prediction on

fibrosis. This makes it harder for making diagnosis decision by just relying on an individual

model. Moreover, the selection of which CNN model to use during the disease detection

process also becomes a problem. However, with the employment of the ensemble models, the

wrong predictions among each individual will be balanced so that the overall detection

accuracy is improved.

2. The performance on the localization of disease manifestations provided by individual CNN

model presents to be unstable and inaccurate even with the correct disease diagnosis result.

During the localization of effusion, the manifestation is supposed to be located at bilateral

lung tip site, however, ResNet34 presents an inaccurate heat map which covers almost the

whole lung part of the patient but not the lower left lung tip. Similar problems exist in

ResNet101 on the localization of fibrosis as well as InceptionV3 and ResNet34 on the nodule

case shown above. This partially covering and uncovering of lung diseases part on a CXR

image may cause great confusion to people who are using the computer-aided detection

system. Moreover, the problem of selecting among different CNN models remains. Ensemble

models solve these problems by considering the results of all CNN models and properly

integrating them. Therefore, locations of each disease provided by the ensemble models are a

102

lot more accurate and cover almost all the suspected abnormal part that is related to different

TB manifestations.

3. Even when there are some mis-predicted cases generated by single CNN models, the diagnosis

and localization of different TB manifestations remain a stable performance with high

accuracy.

4. The diagnosis and localization results of different TB manifestations provided by the two

ensemble models via linear average as well as voting mechanism are the same.

7.4 Conclusion

In this section, we compared the disease localization performances among six single CNN models

and two ensemble models on CXR image of seven different TB manifestations.

During the experiment, the predicted results are given with a confidence measure and have been

compared with the true labels. Quality of the class activation mapping during the localization of

TB manifestations are evaluated by the visual comparison between the part that has been displayed

in the heat map with high energy and our prior knowledge to the disease.

The results present a great improvement of the overall performance of the two ensemble models

proposed in chapter 6 compare to that generated by six individual models on both diagnosis and

localization of seven TB manifestations. Moreover, the implementation of class activation

mapping provides an effective way of visualizing the location of the suspected abnormality on

CXR and therefore reduced the complexity of the visual understanding of CNN.

103

Chapter 8

Conclusions and Future Work

8.1 Conclusions

The main objective of our study was to create a computer-aided detection system via CNN models

for medical purposes. We have explored different deep CNN models (VGGNet, GoogLeNet

Inception Model and ResNet) which vary in the structure of modules as well as number of layers.

We presented a unified modification to the structure of last few layers before the output and added

an extra fine-tuning step to the training process. Ensemble model based on those improved CNN

models is then implemented and employed to further improve the diagnosis accuracy and the

overall stability of the computer-aided detection system.

Accuracy, specificity, recall, precision, F1-score and AUC are used to measure the performance

on the abnormality detection from CXR images provided by six deep CNN models and two

ensemble models. During the identification of specific TB manifestations, accuracy and AUC are

presented and compared. At last, class activation mapping is implemented on CNN models to

provide a visualization of the suspected disease location on CXRs.

The experiment has been tested on three CXR datasets that are publicly available. Two small

datasets are mainly used for abnormality detection and generate the prediction of whether the

patient is TB positive based on the input CXR. The largest dataset is used for both abnormality

detection and the diagnosis of specific disease among seven TB manifestations.

104

Our results show that with the superimposition of the improvement steps, the overall performance

of CNN models keeps getting better. Among the three steps, structure modifications generate the

largest increment on the prediction accuracy for single CNN models. The fine-tuning step helps to

slightly improve performance. By taking the ensemble of the individual CNN models, the

classification accuracy of CXRs is further improved. Moreover, each model presents an unstable

and unpredictable performance on different datasets and for different classification tasks, with the

employment of the ensemble models, the classification accuracy reaches to the highest and the

stability of the performance was greatly improved. Even for the disease localization task, ensemble

models can present the best results compared to single CNN models. To conclude, the combination

of the three improvement steps can greatly improve the overall performance of our proposed

computer-aided detection system used for TB diagnosis and localization.

The contributions of the thesis can be summarized as follows:

1) We selected three CXR datasets from various sources with various sizes to solve the potential

problems that exist in a single dataset, such as data being not representative and limited only

to abnormality detection tasks.

2) A unified standard data preprocessing procedure was adopted which involved removing

images with bad quality, cropping the images with large black background, enhancing the

images using CLAHE algorithm and image resizing, thereby improving the quality of the

images and eliminating the unnecessary error caused by the variation in the image quality.

3) Unified modifications on CNN models were made and different learning rates were applied

to different layer groups inside the CNN model, which provided improvements on all 3

datasets for both lung abnormality detection as well as the detection of TB related

manifestations in NIH chest x-ray dataset.

105

4) To maximize the performance of the CNN models on each diagnostic task, Artificial Bee

Colony (ABC) optimization was implemented as an additional fine-tuning step.

5) Linear average based and voting based ensemble learning methods were used to combine the

results from each model into an aggregated result to prevent the overfitting problem and to

improve the stability of the models’ performance.

6) Class activation mapping algorithm using CNN’s built-in attention mechanism was

implemented to localize the suspected area of the detected disease on CXR for better

interpretation of the diagnostic results. The visualization of the suspected area of the disease

can help clinicians to confirm the disease and to catch missed information from unsuspecting

eyes.

7) The proposed system achieves an accuracy of 100% and an AUC of 1.0 on the Montgomery

CXR dataset for the abnormality detection task with all 3 training/validation ratios. Compared

to similar work on these datasets mentioned in Chapter 2, this is the best performance. For the

abnormality detection task on Shenzhen Hospital Dataset, an overall accuracy of over 94%

and an average AUC of over 0.99 with all 3 training/validation ratios was achieved and is still

the best performance among all the experiments done by other researchers. The performance

achieved on NIH Chest X-Ray dataset for the abnormality detection range from an accuracy

of 89.56% to 95.49% with training/validation ratios 7:3, 8:2, 9:1 and an average AUC of over

0.93. For the detection of 7 TB related lung diseases on the same dataset, an overall accuracy

of 90% was achieved with all 3 training/validation ratios. Moreover, the AUC of 0.97 was

achieved for each TB related lung disease. This outstanding performance is so far the best

compared to similar work done either on the same dataset or other large CXR datasets.

106

8.2 Future work

Some aspects were not covered due to the lack of time as well as the help of experts in the medical

field. During the fine-tuning process, the computational complexity, as well as the computational

time consumptions were not measured. Few works have been mentioned to decrease the computing

time during the model’s training process. In addition, for the disease localization, the comparison

between the performance of implementing class activation map in CNN models and the

employment of other object detection methods such as single-shot detection [72] and Fast R-CNN

[73] was not performed since the latter algorithms need the information of the specific disease

manifestations part located by radiologists on the CXR images.

107

References

[1] World Health Organization, 2018. Global tuberculosis report 2018. Geneva: World Health

Organization.

[2] Adler D, Richards WF. Consolidation in primary pulmonary tuberculosis. Thorax. 1953

Sep;8(3):223.

[3] Vorster MJ, Allwood BW, Diacon AH, Koegelenberg CF. Tuberculous pleural effusions:

advances and controversies. Journal of thoracic disease. 2015 Jun;7(6):981.

[4] Chung MJ, Goo JM, Im JG. Pulmonary tuberculosis in patients with idiopathic pulmonary

fibrosis. European journal of radiology. 2004 Nov 1;52(2):175-9.

[5] Mishin V, Nazarova NV, Kononen AS, Miakishev TV, Sadovski AI. Infiltrative pulmonary

tuberculosis: course and efficiency of treatment. Problemy tuberkuleza i boleznei legkikh.

2006(10):7-12.

[6] Cherian MJ, Dahniya MH, Al‐Marzouk NF, Abel A, Bader S, Buerki K, Mahdi OZ.

Pulmonary tuberculosis presenting as mass lesions and simulating neoplasms in adults.

Australasian radiology. 1998 Nov;42(4):303-8.

[7] Kant S, Kushwaha R, Verma SK. Bilateral nodular pulmonary tuberculosis simulating

metastatic lung cancer. The Internet Journal of Pulmonary Medicine. 2007;8.

[8] Gil V, Soler JJ, Cordero PJ. Pleural Thickening in Patients With Pleural Tuberculosis.

Chest. 1994 Apr 1;105(4):1296.

[9] Lowe, D.G., 1999, September. Object recognition from local scale-invariant features.

In iccv (Vol. 99, No. 2, pp. 1150-1157).

[10] Ojala, T., Pietikäinen, M. and Mäenpää, T., 2002. Multiresolution gray-scale and rotation

invariant texture classification with local binary patterns. IEEE Transactions on Pattern

Analysis & Machine Intelligence, (7), pp.971-987.

[11] Basavaprasad, B. and Ravi, M., 2014. A study on the importance of image processing and

its applications. IJRET: International Journal of Research in Engineering and

Technology, 3.

[12] Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and

problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based

Systems. 1998 Apr;6(02):107-16.

[13] Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks.

InInternational Conference on Machine Learning 2013 Feb 13 (pp. 1310-1318).

108

[14] Huang Z, Pan Z, Lei B. Transfer learning with deep convolutional neural network for SAR

target classification with limited labeled data. Remote Sensing. 2017 Aug 31;9(9):907.

[15] Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on knowledge and data

engineering. 2010 Oct 1;22(10):1345-59.

[16] Khuzi, A.M., Besar, R., Zaki, W.W. and Ahmad, N.N., 2009. Identification of masses in

digital mammogram using gray level co-occurrence matrices. Biomedical imaging and

intervention journal, 5(3).

[17] Carrillo-de-Gea, J.M., García-Mateos, G., Fernández-Alemán, J.L. and Hernández-

Hernández, J.L., 2016. A computer-aided detection system for digital chest

radiographs. Journal of healthcare engineering, 2016.

[18] Yang, M.C., Moon, W.K., Wang, Y.C.F., Bae, M.S., Huang, C.S., Chen, J.H. and Chang,

R.F., 2013. Robust texture analysis using multi-resolution gray-scale invariant features for

breast sonographic tumor diagnosis. IEEE Transactions on Medical Imaging, 32(12),

pp.2262-2273.

[19] Sarraf, S. and Tofighi, G., 2016. DeepAD: Alzheimer′ s disease classification via deep

convolutional neural networks using MRI and fMRI. BioRxiv, p.070441.

[20] Huynh, B.Q., Li, H. and Giger, M.L., 2016. Digital mammographic tumor classification

using transfer learning from deep convolutional neural networks. Journal of Medical

Imaging, 3(3), p.034501.

[21] Zou, L., Zheng, J., Miao, C., Mckeown, M.J. and Wang, Z.J., 2017. 3D CNN based

automatic diagnosis of attention deficit hyperactivity disorder using functional and

structural MRI. IEEE Access, 5, pp.23626-23636.

[22] Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A. and Mougiakakou, S., 2016.

Lung pattern classification for interstitial lung diseases using a deep convolutional neural

network. IEEE transactions on medical imaging, 35(5), pp.1207-1216.

[23] Kim, G.B., Jung, K.H., Lee, Y., Kim, H.J., Kim, N., Jun, S., Seo, J.B. and Lynch, D.A.,

2018. Comparison of shallow and deep learning methods on classifying the regional pattern

of diffuse lung disease. Journal of digital imaging, 31(4), pp.415-424.

[24] Jaiswal, A.K., Tiwari, P., Kumar, S., Gupta, D., Khanna, A. and Rodrigues, J.J., 2019.

Identifying Pneumonia in Chest X-Rays: A Deep Learning Approach. Measurement.

[25] Lakhani, P. and Sundaram, B., 2017. Deep learning at chest radiography: automated

classification of pulmonary tuberculosis by using convolutional neural

networks. Radiology, 284(2), pp.574-582.

[26] Pasa, F., Golkov, V., Pfeiffer, F., Cremers, D. and Pfeiffer, D., 2019. Efficient Deep

Network Architectures for Fast Chest X-Ray Tuberculosis Screening and

Visualization. Scientific reports, 9(1), p.6268.

109

[27] Hwang, S., Kim, H.E., Jeong, J. and Kim, H.J., 2016, March. A novel approach for

tuberculosis screening based on deep convolutional neural networks. In Medical imaging

2016: computer-aided diagnosis (Vol. 9785, p. 97852W). International Society for Optics

and Photonics.

[28] Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A.,

Langlotz, C., Shpanskaya, K. and Lungren, M.P., 2017. Chexnet: Radiologist-level

pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225.

[29] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D. and

Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection:

CNN architectures, dataset characteristics and transfer learning. IEEE transactions on

medical imaging, 35(5), pp.1285-1298.

[30] Flach, P.A., 2003. The geometry of ROC space: understanding machine learning metrics

through ROC isometrics. In Proceedings of the 20th international conference on machine

learning (ICML-03) (pp. 194-201).

[31] Stallkamp, J., Schlipsing, M., Salmen, J. and Igel, C., 2012. Man vs. computer:

Benchmarking machine learning algorithms for traffic sign recognition. Neural

networks, 32, pp.323-332.

[32] Vidyasaraswathi HN, Hanumantharaju MC. Review of Various Histogram Based Medical

Image Enhancement Techniques. InProceedings of the 2015 International Conference on

Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015)

2015 Mar 6 (p. 48). ACM.

[33] Rajaraman, S., Candemir, S., Xue, Z., Alderson, P.O., Kohli, M., Abuya, J., Thoma, G.R.

and Antani, S., 2018, July. A novel stacked generalization of models for improved TB

detection in chest radiographs. In 2018 40th Annual International Conference of the IEEE

Engineering in Medicine and Biology Society (EMBC) (pp. 718-721). IEEE.

[34] Rere, L.M., Fanany, M.I. and Arymurthy, A.M., 2016. Metaheuristic algorithms for

convolution neural network. Computational intelligence and neuroscience, 2016.

[35] Parmaksızoğlu, S. and Alçı, M., 2011. A novel cloning template designing method by using

an artificial bee colony algorithm for edge detection of cnn based imaging

sensors. Sensors, 11(5), pp.5337-5359.

[36] Khan, S., Khan, A., Maqsood, M., Aadil, F. and Ghazanfar, M.A., 2019. Optimized gabor

feature extraction for mass classification using cuckoo search for big data e-

healthcare. Journal of Grid Computing, 17(2), pp.239-254.

[37] Islam, M.T., Aowal, M.A., Minhaz, A.T. and Ashraf, K., 2017. Abnormality detection and

localization in chest x-rays using deep convolutional neural networks. arXiv preprint

arXiv:1705.09850.

110

[38] Kwaśniewska, A., Rumiński, J. and Rad, P., 2017, July. Deep features class activation map

for thermal face detection and tracking. In 2017 10th International Conference on Human

System Interactions (HSI) (pp. 41-47). IEEE.

[39] Jaeger, S., Candemir, S., Antani, S., Wáng, Y.X.J., Lu, P.X. and Thoma, G., 2014. Two

public chest X-ray datasets for computer-aided screening of pulmonary

diseases. Quantitative imaging in medicine and surgery, 4(6), p.475.

[40] Jaeger, S., Karargyris, A., Candemir, S., Folio, L., Siegelman, J., Callaghan, F., Xue, Z.,

Palaniappan, K., Singh, R.K., Antani, S. and Thoma, G., 2013. Automatic tuberculosis

screening using chest radiographs. IEEE transactions on medical imaging, 33(2), pp.233-

245.

[41] Candemir, S., Jaeger, S., Palaniappan, K., Musco, J.P., Singh, R.K., Xue, Z., Karargyris,

A., Antani, S., Thoma, G. and McDonald, C.J., 2013. Lung segmentation in chest

radiographs using anatomical atlases with nonrigid registration. IEEE transactions on


[42] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M. and Summers, R.M., 2017. ChestX-Ray8:

Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification

and localization of common thorax diseases. In Proceedings of the IEEE conference on

computer vision and pattern recognition (pp. 2097-2106).

[43] Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J. and Summers, R.M., 2016.

Learning to read chest x-rays: Recurrent neural cascade model for automated image

annotation. In Proceedings of the IEEE conference on computer vision and pattern

recognition (pp. 2497-2506).

[44] Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar

Romeny, B., Zimmerman, J.B. and Zuiderveld, K., 1987. Adaptive histogram equalization

and its variations. Computer vision, graphics, and image processing, 39(3), pp.355-368.

[45] Pisano, E.D., Zong, S., Hemminger, B.M., DeLuca, M., Johnston, R.E., Muller, K.,

Braeuning, M.P. and Pizer, S.M., 1998. Contrast limited adaptive histogram equalization

image processing to improve the detection of simulated spiculations in dense

mammograms. Journal of Digital imaging, 11(4), p.193.

[46] Hinton, G.E. and Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with

neural networks. science, 313(5786), pp.504-507.

[47] Bengio, Y., 2009. Learning deep architectures for AI. Foundations and trends® in Machine

Learning, 2(1), pp.1-127.

[48] Zhang, W., Tanida, J., Itoh, K. and Ichioka, Y., 1988. Shift-invariant pattern recognition

neural network and its optical architecture. In Proceedings of annual conference of the

Japan Society of Applied Physics.

111

[49] Fukushima, K., 1980. Neocognitron: A self-organizing neural network model for a

mechanism of pattern recognition unaffected by shift in position. Biological

Cybernetics, 36(4), pp.193-202.

[50] Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. Deep inside convolutional networks:

Visualising image classification models and saliency maps. arXiv preprint

arXiv:1312.6034.

[51] Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and understanding

convolutional networks. In European conference on computer vision (pp. 818-833).

Springer, Cham.

[52] Zeiler, M.D. and Fergus, R., 2013. Stochastic pooling for regularization of deep

convolutional neural networks. arXiv preprint arXiv:1301.3557.

[53] Boureau, Y.L., Le Roux, N., Bach, F., Ponce, J. and LeCun, Y., 2011, November. Ask the

locals: multi-way local pooling for image recognition. In ICCV'11-The 13th International

Conference on Computer Vision.

[54] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale

image recognition. arXiv preprint arXiv:1409.1556.

[55] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,

V. and Rabinovich, A., 2015. Going deeper with convolutions. In Proceedings of the IEEE

conference on computer vision and pattern recognition (pp. 1-9).

[56] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethinking the

inception architecture for computer vision. In Proceedings of the IEEE conference on


[57] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition.

In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.

770-778).

[58] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D. and

Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection:

CNN architectures, dataset characteristics and transfer learning. IEEE transactions on


[59] Pan, S.J. and Yang, Q., 2009. A survey on transfer learning. IEEE Transactions on

knowledge and data engineering, 22(10), pp.1345-1359.

[60] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and Fei-Fei, L., 2009, June. Imagenet: A

large-scale hierarchical image database. In 2009 IEEE conference on computer vision and

pattern recognition (pp. 248-255).

[61] Ioffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training

by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

112

[62] Karaboga, D., 2005. An idea based on honey bee swarm for numerical optimization (Vol.

200). Technical report-tr06, Erciyes University, engineering faculty, computer engineering

department.

[63] Karaboga, D. and Basturk, B., 2007. A powerful and efficient algorithm for numerical

function optimization: artificial bee colony (ABC) algorithm. Journal of global

optimization, 39(3), pp.459-471.

[64] Bullinaria, J.A. and AlYahya, K., 2014. Artificial bee colony training of neural networks.

In Nature Inspired Cooperative Strategies for Optimization (NICSO 2013) (pp. 191-201).

Springer, Cham.

[65] Oza, N.C. and Tumer, K., 2008. Classifier ensembles: Select real-world

applications. Information Fusion, 9(1), pp.4-20.

[66] Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.

[67] Freund, Y. and Schapire, R.E., 1997. A decision-theoretic generalization of on-line

learning and an application to boosting. Journal of computer and system sciences, 55(1),

pp.119-139.

[68] Wolpert, D.H., 1992. Stacked generalization. Neural networks, 5(2), pp.241-259.

[69] Kuncheva, L.I. and Whitaker, C.J., 2003. Measures of diversity in classifier ensembles and

their relationship with the ensemble accuracy. Machine learning, 51(2), pp.181-207.

[70] Zhou, Z.H., Wu, J. and Tang, W., 2002. Ensembling neural networks: many could be better

than all. Artificial intelligence, 137(1-2), pp.239-263.

[71] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2016. Learning deep

features for discriminative localization. In Proceedings of the IEEE conference on


[72] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., 2016,

October. Ssd: Single shot multibox detector. In European conference on computer

vision (pp. 21-37). Springer, Cham.

[73] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object

detection with region proposal networks. In Advances in neural information processing

systems (pp. 91-99).

113

APPENDIX

Appendix A: Public Chest X-Ray Datasets

Montgomery County Chest X-Ray dataset and Shenzhen Hospital Chest X-Ray dataset are

available in: https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/

The NIH ChestX-Ray8 dataset and its detailed annotations is available in:

https://www.kaggle.com/nih-chest-xrays/datasets

Appendix B: Thesis Source Code

Thesis source code and result display are available in:

https://drive.google.com/open?id=1jrMz7nHhWlxZdsz4sWhWlbyIz3_s-ybF

https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/

https://www.kaggle.com/nih-chest-xrays/datasets

https://drive.google.com/open?id=1jrMz7nHhWlxZdsz4sWhWlbyIz3_s-ybF

Documents

Tuberculosis Detection and Localization from Chest X-Ray