Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Predictive statistical modelling
approach to estimating TB burden Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker
Overall aim, interim results
Overall aim of predictive models: 1. To enable predictions of TB incidence, prevalence and mortality
- from 1990 to 2015 - for a selection of countries for which the Task Force is
mandate to produce estimates
2. To identify a set of conditions which warrant or do not warrant the use of these models
Focus of this presentation:
- Development of database and model structures - to enable predictions for 2013
2
Approach to predictive modelling
Explanatory models → correctly capture causal pathways Predictive models → maximize predictive power (training set)
Titel
3
Source: Schmueli: “To explain or to predict?"
Titel
4
Task 1: Incidence
Titel
5
Use data from robust sruveillance
systemsIncidence in middle/high
income countries relying
on indirect incidence
estimates mostly from high-income
countries
Incidence: Training set vs. predictions
Titel
6
Task 2: Prevalence
Titel
7
Prevalence for countries where national surveys have
not been implemented
-low/middle income countries
-with predicted prevalence of
over 0.1%
Use data from recent
national TB prevalence
surveys
Prevalence: Training set vs. predictions
Titel
8
Task 3: Mortality
Titel
9
Mortality in countries without
vital registration data
mostly low-income
countries
mostly from middle
and high-income
countries
Use data from countries
using vital registration
systems to estimate
mortality
Mortality: Training set vs. predictions
Titel
10
Titel
11
Conceptual framework
TB outcome data
TB case notification
MDR TB
TB programmatic determinants
Weak health system, case finding ability (suspects, diagnostics)
Poor access to TB services, treatment outcomes
Inappropriate health seeking behavior
BCG vaccination coverage in children
Co-morbidities
HIV
Poor-nutritional status
Diabetes
Lung diseases
Socio environmental factors
Weather: Humidity and temperature
High risk groups: prisoners, homeless people, migrants, drug
addicts, refugees, IDP
Urbanization: population density, poor water source and sanitation,
crowded living conditions, poor ventilation, indoor air pollution
Smoking, alcoholism
Aging populations
Titel
12
Database compilation
Titel
13
National Level Subnational level
-Tuberculosis Monitoring and Evaluation
(TME)-National statistical agencies (e.g. census)
-Global Health Repository (GHR) -Multiple Indicator Surveys (MICS)
-World Bank -Demographic and Health Surveys (DHS)
-UNICEF reports (BCG prevalence) -District level surveys (India)
-International Diabetes Federation -NTPs (TB notifications)
Predictor variables
Outcome variables
National estimates: WHO TB database (Global TB data collection system)
-Subnational estimates (prevalence): Survey reports/collaborators
Titel
14
Database completeness
WHO TB estimates were produced in 2013 for 217 countries
- countries with complete set of predictors in 2013: 166 Missing data imputation 1. Predictor data only available at set intervals (e.g. every 5 years)
- missing values imputed using a linear imputation. - start and end observations used as anchor points
2. Predictor data missing for the most recent years only
- linear trend imputation to extrapolate the existing series - Only if fit R-sq >90%
Titel
15
Titel
16
Before imputation
World
Bank
GHR
TME
climate, bcg,
diabetes
Darker shades
= data missing X
Lighter shades
= data available
Titel
17
After imputation
World
Bank
GHR
TME
climate, bcg,
diabetes
Missing value
imputation
"successful" for World
Bank and GHR data
Still many missing
covariates for 2013
Incidence: Model inputs and outputs
Outcome variable:
- Num: Estimated number of incident
cases (all forms)
- Den: Estimated total population
Training set:
- First instance 1,688 datapoints over 24
years (1990-2013)
- Final model: 213 datapoints with
complete data
Predictions:
- 2013 estimates for 100 middle income
+ 6 high income countries
- Not all have complete data for
predictions
Titel
18
Percentage of countries with complete covariate data out
of all eligible countries for training set, per year
(73 countries, 213 datapoints)
Prevalence: Model inputs and outputs
Outcome variable - Bacteriologically confirmed (BC) TB prevalence
Training set: - Country estimates from prevalence surveys conducted from 2007
onwards (standardised analysis methodology): 13 countries - Subnational estimates from TB prevalence surveys: 5 countries - India – only district level prevalence survey estimates: 2 districts - Total: 30 datapoints
Predictions - 2013 estimates for 25 low and 49 middle income countries - without prevalence survey - with expected prevalence >0.1% according to WHO estimates
Titel
19
Mortality: model inputs and outputs
Outcome:
- Num: Estimated number of deaths from
TB (all forms, exc. HIV) from vital
registration systems
- Den: Estimated total population
Training set
- First instance 3,022 datapoints over 24
years (1990-2013)
- Final model: 307 datapoints with
complete data
Predictions
- 2013 estimates for 11 high, 42 middle
32 low income countries and 6 countries
with missing income status
- Not all have complete data for
predictions
Titel
20
Percentage of countries with complete covariate data out
of all eligible countries for training set, per year
(128 countries, 307 datapoints)
Titel
21
Selection criteria for predictors in model
1. Predictors selected based on completeness in training dataset - <40% complete excluded for mortality and incidence models - <100% complete excluded for prevalence models
2. Univariate relationships: complete predictors vs. outcome
3. Pairwise correlations
- Identify highly correlated predictors (> 0.8) - drop based on the lowest relative fit vs. outcome
→ Mortality: Reduction of variables from 166 to 37 → Prevalence: Reduction of variables from 166 to 30 → Mortality: Reduction of variables from 166 to 54
Titel
22
Selected predictors (all models combined)
Titel
23
TB outcome data
• All new cases and relapse cases with
unknown treatment history *
• New all forms cases/rate (+ lag 1/lag 2y) *
• New laboratory confirmed cases/rate (+
lags) *
• All notified cases*
• % MDR among new cases
TB programmatic determinants
• Total expenditure on health/expressed as % GDP
• TB patients with HIV test result in TB register
• All previously treated TB cases
• Percentage retreated out of all cases
• Treatment success rate in new and retreatment cases
• Number of laboratories providing tuberculosis
diagnostic services using sputum smear microscopy
• BCG vaccination rates in children
Co-morbidities
• TB patients recorded as HIV positive
• % HIV positive among all patients notified
• Prevalence of diabetes
* Not included in incidence model
Socio environmental factors
• Gross National Income per capita
• Life expectancy at birth (overall, M, F)
• Life expectancy at 60 (overall, M, F)
• Total population (overall, M. F), sex ratio
• % population 15 or younger/60 or older
• Maternal mortality ratio, under-five mortality
• Percentage urban population, population density
• % population with access to improved water/sanitation
• Average temperature in the coldest/warmest month
• Average temperature
• Average precipitation
Titel
24
Model selection approach
Mortality and incidence: - GLM, Poisson, negative binomial and zero inflated distribution
Prevalence: - GLM, binomial (logistic link) and negative binomial distributions
Final multivariate model selected based on the Akaike Information Criterion AIC (Likelihood based)
Titel
25
Titel
26
Final model: incidence (negative binomial)
Titel
27
Model predictors
coefficient
(log scale) strength
(Intercept) -8.02000
Percentage retreatment (out of all cases) 0.02368 23.7
% population 60 yrs + 0.02053 20.5
% MDR out of all new cases 0.00466 4.7
Average temperature in warmest month 0.00311 3.1
TB patients recorded as HIV positive 0.00141 1.4
All previously treated cases 0.00054 0.5
Average precipitation 0.00027 0.3
GNI 0.00001 0.0
TB patients with HIV test result in TB register -0.00005 -0.1
Total expenditure on health -0.00008 -0.1
Life expectancy at 60 in males -0.08271 -82.7
Government expenditure on health (% GDP) -0.10510 -105.1
Diabetes prevalence -2.31000 -2310.0
Table 1. Estimated coefficients (log scale) of final negative binomial multivariate model for incidence (n=213)
Estimate
Std. Error
Pr(>|z|)
(Intercept) -8.02700 4.40E-01 0.00000
totexphaer -0.00008 3.35E-05 0.01352
govexphgdp -0.10510 2.77E-02 0.00015
lifeexp60m -0.08271 1.84E-02 0.00001
pop60 0.02053 7.16E-03 0.00416
tmax 0.00311 6.25E-04 0.00000
ret_af 0.00054 8.47E-05 0.00000
prec 0.00027 4.91E-05 0.00000
hivtest_pos 0.00141 2.22E-04 0.00000
perc_mdr_new 0.00466 2.13E-03 0.02835
gnipp 0.00001 2.39E-06 0.00002
hivtest -0.00005 1.81E-05 0.00447
DM_prev -2.31000 1.25E+00 0.06536
perc_ret 0.02368 8.49E-03 0.00531
Final model: incidence
Titel
28
Predicted vs. observed rate (log scale)
Predicted vs. observed rate
Titel
29
Final model: prevalence (binomial logistic)
Model predictors
coefficient
(log scale) Strength
(Intercept) -3.03588
Climate score 0.16039 160
New laboratory confirmed rate 0.00812 8
BCG coverage -0.03610 -36
Final model: prevalence
Titel
30
0
.00
2.0
04
.00
6.0
08
Pre
dic
ted p
reva
lence
0 .002 .004 .006 .008Observed prevalence
Predicted vs. observed prevalence
-7-6
.5-6
-5.5
-5
Pre
dic
ted p
reva
lence
-7.5 -7 -6.5 -6 -5.5 -5Observed prevalence
Predicted vs. observed prevalence
(logistic scale)
Final model: mortality (negative binomial)
Titel
31
Model predictors
coefficient
(log scale) strength
(Intercept) -10.57740
New all forms rate 0.71356 714
BCG coverage 0.17597 176
TB patients recorded as HIV positive 0.16718 167
Percentage retreatment (out of all cases) 0.15697 157
Precipiation 0.11985 120
% MDR out of all new cases 0.05991 60
Diabetes prevalence -0.09344 -93
Government expenditure on health -0.13399 -134
Urban Population -0.14584 -146
Treatment success rate for re-treatment cases -0.21212 -212
Final model: mortality
Titel
32
Predicted vs. observed rate (log scale)
Predicted vs. observed rate
Model fit: deviance residuals
Titel
33
Incidence Prevalence
0
.00
2.0
04
.00
6.0
08
p_h
at
-5 0 5Deviance residual
Mortality
Model validation: cross validation
1. Split data randomly into k partitions 2. For each partition fit specified model using the other k-1 groups. 3. Pseudo-R-sq: square of correlation predicted vs. observed
Incidence: k=5 R-sq=0.94 Prevalence: k=2, x5 R-sq=0.76 Mortality: k=5 R-sq=0.89
Titel
34
Training set
K=2
Develop model in this
subset
Check predicted vs.
observed in this subset
Titel
35
Discussion
Predictive models could be fitted for all three tasks - Goodness of fit satisfactory - Further refinement of models and database necessary before
predictions can be made
Incidence and mortality: - Include random effects for countries or income status - build one model just on high income countries and one just on
middle income countries and compare with variables/coefficients
Include time lagged variables: - so far only included lagged TB notification rates (1 and 2 years)
New predictor data recently compiled: - Number of large cities >500.000 and >1million inhabitants - Prevalence of prisoners (UNDP) , migrants (World Bank), drug
use (UNODC), refugees and displaced populations (UNHCR)
Titel
36
Thank you
Titel
37
Titel
38