View
22
Download
2
Category
Preview:
Citation preview
AN APPLICATION IN TUBERCULOSIS PREVALENCE
MULTIPLE FACTOR ANALYSIS
WITH ESTIMATED DATA
Dec 4, 2014
ISyE 6405 Fall 2014 Project
Chaoyi Wu Farida Jariwala
2
OVERVIEW
Problem Statement
• Background and motivation
• Challenges
Methodology
• Extend PCA to MCA (multiple factor analysis)
• Consider interval data in MCA
Analysis
• Data
• MCA modeling in R: package “FactoMineR”
• Output analysis
Conclusion and next steps
Reference 2
3
Problem Statement
Tuberculosis, or TB, is an infectious bacterial disease which most commonly
affects the lungs. Tuberculosis (TB) is second only to HIV/AIDS as the greatest
killer worldwide due to a single infectious agent. In 2013, 9 million people fell ill
with TB and 1.5 million died from the disease. People living with HIV are 26-31
times more likely to develop TB than persons without HIV. (WHO 2014)
Patterns among TB prevalence, HIV cases, healthcare resources and other
factors are helpful to curb TB prevalence.
Background and motivation
4
Problem Statement
Challenges
• The number of potential factors that affect TB is large
Healthcare input: TB Immunization, TB test, general health resource expense
HIV
Tobacco use
other
→ group factors (variables) into categories and reduce dimensions by MFA
• The data (i.e. Population, cases) is estimated with variances
→ include intervals in pattern analysis with vertices method symbolic PCA (V-SPCA)
Snap shot from WHO dataset http://apps.who.int/gho/data
5
Methodology
• Multiple factor analysis (MFA)MFA analyzes observations described by several groups or sets of variables in two steps:
(1) A PCA is performed on each group which is then normalized. A same weight is associated to each variable of the a group. The weight is the largest eigenvalue of the PCA on the group.
(2) The normalized data sets are merged to form a unique matrix and a global PCA is performed on this matrix.
The data type of a variable can be continuous or categorical, but the data type for variables in one set should be the same. (Abdi, H. 2007)
5
6
Methodology
• Use V-SPCA to deal with interval data in MCAVertices method symbolic PCA (V-SPCA) performs a classical PCA on interval data. Given a dataset 𝑿 that contains 𝑵 observations described by 𝒑 variables of interval type, derive a new dataset 𝑿𝑽 from it and use the new dataset for PCA. (Zuccolotto. 2006)
6The method is still valid for MCA.
Country Prevalent TB cases Number of adults aged 15 and over living with HIV
Chile 3500 [1500-6400] 39 000 [25 000-61 000]
Country Prevalent TB cases Number of adults aged 15 and over living with HIV
Chile1 1500
Chile2 1500
Chile3 6400
Chile4 6400
39 000
61 000
25 000
61 000
7
Analysis
7
Data structure
• 7 countries: Sweden, Malaysia, Hungary, Sri Lanka Chile, Mexico
• 19 variables, 8 groupsdata for TB and HIV estimated with confidence intervals.
Group Variable name Note
TB TB prevalent TB cases in 2012/total population (%)
HIV HIV 15+ living with HIV/15+ population (%)
PM PM10 2012 PM10 (Annual mean, ug/m3)
TST DST 2012 2012 Laboratories providing DST (drug susceptibility testing) (per 5 million population)
TST DST 2011 2011 Laboratories providing DST (drug susceptibility testing) (per 5 million population)
TST DST 2010 2010 Laboratories providing DST (drug susceptibility testing) (per 5 million population)
TST TB dgns clt 2012 2012 Laboratories providing TB diagnostic services using culture (per 5 million population)
TST TB dgns clt 2011 2011 Laboratories providing TB diagnostic services using culture (per 5 million population)
TST TB dgns clt 2010 2010 Laboratories providing TB diagnostic services using culture (per 5 million population)
TST TB dgns micro 2012 2012 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)
TST TB dgns micro 2011 2011 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)
TST TB dgns micro 2010 2010 Laboratories providing TB diagnostic services using sputum smear microscopy (per 100 000 population)
GNI GNI 2012 General national income per capital (current USD)
HC Health exps 2010 2010 Health expenditure (public) per capital
HC Health exps 2011 2011 Health expenditure (public) per capital
HC Health exps 2012 2012 Health expenditure (public) per capital
BCG BCG 1 y Imz 1992 1992 Immunization, BCG (% of one-year-old children)
BCG BCG 1 y Imz 2012 2012 Immunization, BCG (% of one-year-old children)
SMK SMK 2011 2011 Smoking prevalence (% of adults)
8
Output analysis
8
• MCA with mean estimation
Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 3.792 2.496 1.115 0.667 0.567 0.036
% of var 43.719 28.781 12.857 7.695 6.538 0.41
Cumulative % of var 43.719 72.5 85.357 93.052 99.59 100
Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
TB 0.058 0.003 0.783 0.614 0.137 0.019
HIV 0.000 0.000 0.798 0.637 0.189 0.036
PM 0.560 0.314 0.002 0.000 0.007 0.000
TST 0.399 0.125 0.340 0.091 0.272 0.058
GNI 0.948 0.899 0.004 0.000 0.003 0.000
HC 0.969 0.939 0.000 0.000 0.007 0.000
BCG 0.829 0.688 0.105 0.011 0.007 0.000
SMK 0.027 0.001 0.463 0.215 0.495 0.245
9
Output analysis
9
• MCA with interval estimation
Eigenvalues Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 3.794 2.38 1.087 0.658 0.57 0.149
% of var 43.749 27.445 12.532 7.59 6.572 1.72
Cumulative % of var 43.749 71.194 83.726 91.316 97.888 99.608
Groups Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
TB 0.059 0.003 0.688 0.474 0.148 0.022
HIV 0.000 0.000 0.719 0.517 0.207 0.043
PM 0.564 0.318 0.005 0.000 0.002 0.000
TST 0.402 0.127 0.358 0.100 0.259 0.053
GNI 0.944 0.890 0.004 0.000 0.002 0.000
HC 0.966 0.934 0.000 0.000 0.006 0.000
BCG 0.834 0.695 0.102 0.010 0.005 0.000
SMK 0.026 0.001 0.505 0.255 0.457 0.208
10
Output analysis
10
• Comparison
Cumulative % of var Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Mean 43.719 72.5 85.357 93.052 99.59 100
Interval 43.749 71.194 83.726 91.316 97.888 99.608
Groups Mean Interval Mean Interval TB 0.058 0.059 0.783 0.688
HIV 0.000 0.000 0.798 0.719
PM 0.560 0.564 0.002 0.005
TST 0.399 0.402 0.340 0.358
GNI 0.948 0.944 0.004 0.004
HC 0.969 0.966 0.000 0.000
BCG 0.829 0.834 0.105 0.102
SMK 0.027 0.026 0.463 0.505
Dim.2Dim.1
The comparison shows the two MCA’s are almost the same. WHY?
11
Output analysis
11
• Pattern analysis with interval data
o The first two dimensions account for more than 70% of the variance
o {HIV, TB, SMK} and {PM, BCG, HC, GNI} are almost orthogonal
1. TB is almost not in the 1st dimension
2. The variables in the latter set doesn’t have a strong correlation with TB
o The individual contribution to the 2nd dimension brings further investigation:
it accounts for the difference between TB, HIV and Tobacco use
Guess: Smoking prevalance in 2011 is not a reasonable factor
It confirms HIV and TB are highly correlated
o The first dimension can be considered as
a score for economics. The higher, the better
(see the individual factor map in slide 9)
Variable Contribution to Dim.2
TB -0.948
HIV -0.794
SMK.2011 0.71
Variable Contribution to Dim.1
PM10 -0.751
DST.2012 0.656
DST.2011 0.656
DST.2010 0.67
TB.dgns.clt.2012 -0.1
TB.dgns.clt.2011 -0.093
TB.dgns.clt.2010 -0.117
TB.dgns.micro.2012 -0.532
TB.dgns.micro.2011 -0.499
TB.dgns.micro.2010 -0.556
GNI 0.971
Health.exps.2010 0.985
Health.exps.2011 0.984
Health.exps.2012 0.98
BCG.1.y.Imz.1992 -0.87
BCG.1.y.Imz.2012 -0.94
12
Conclusion and next steps
12
• HIV and TB are correlated
• V-SPCA doesn’t make a
difference in this project
• Factors(variables) selection
is to be improved
• Model validation need to be
conducted
• Analysis with dimensions
can be more detailed
13
Abdi, H. and Valentin, D. (2007). Multiple Factor Analysis. In Neil Salkind
(Ed): Encyclopedia of Measurement and Statistics. Thousand Oaks: Sage.
Zuccolotto, P.(2006). Principal Components of Sample Estimates: an
Approach through Symbolic Data Analysis. Stat Meth & Appl. 16. 173-192.
Springer (Verlag). DOI: 10.1007/s10260-006-0024-6.
Le, S., Josse, J. and Husson, F. (2008). FactoMineR: an R package for
multivariate analysis. Journal of Statistical Software. 25(1). American
Statistical Association
WHO. (2014). World Health Organization/Media center/Tuberculosis:
http://www.who.int/mediacentre/factsheets/fs104/en/
The world bank. (2014). The world bank/Data: http://www.worldbank.org/
13
Reference
1414
Thank you
Q & A
Recommended