Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Predictive statistical modelling

approach to estimating TB burden Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker

Overall aim, interim results

Overall aim of predictive models: 1. To enable predictions of TB incidence, prevalence and mortality

- from 1990 to 2015 - for a selection of countries for which the Task Force is

mandate to produce estimates

2. To identify a set of conditions which warrant or do not warrant the use of these models

Focus of this presentation:

- Development of database and model structures - to enable predictions for 2013

2

Approach to predictive modelling

Explanatory models → correctly capture causal pathways Predictive models → maximize predictive power (training set)

Titel

3

Source: Schmueli: “To explain or to predict?"

Titel

4

Task 1: Incidence

Titel

5

Use data from robust sruveillance

systemsIncidence in middle/high

income countries relying

on indirect incidence

estimates mostly from high-income

countries

Incidence: Training set vs. predictions

Titel

6

Task 2: Prevalence

Titel

7

Prevalence for countries where national surveys have

not been implemented

-low/middle income countries

-with predicted prevalence of

over 0.1%

Use data from recent

national TB prevalence

surveys

Prevalence: Training set vs. predictions

Titel

8

Task 3: Mortality

Titel

9

Mortality in countries without

vital registration data

mostly low-income

countries

mostly from middle

and high-income

countries

Use data from countries

using vital registration

systems to estimate

mortality

Mortality: Training set vs. predictions

Titel

10

Titel

11

Conceptual framework

TB outcome data

TB case notification

MDR TB

TB programmatic determinants

Weak health system, case finding ability (suspects, diagnostics)

Poor access to TB services, treatment outcomes

Inappropriate health seeking behavior

BCG vaccination coverage in children

Co-morbidities

HIV

Poor-nutritional status

Diabetes

Lung diseases

Socio environmental factors

Weather: Humidity and temperature

High risk groups: prisoners, homeless people, migrants, drug

addicts, refugees, IDP

Urbanization: population density, poor water source and sanitation,

crowded living conditions, poor ventilation, indoor air pollution

Smoking, alcoholism

Aging populations

Titel

12

Database compilation

Titel

13

National Level Subnational level

-Tuberculosis Monitoring and Evaluation

(TME)-National statistical agencies (e.g. census)

-Global Health Repository (GHR) -Multiple Indicator Surveys (MICS)

-World Bank -Demographic and Health Surveys (DHS)

-UNICEF reports (BCG prevalence) -District level surveys (India)

-International Diabetes Federation -NTPs (TB notifications)

Predictor variables

Outcome variables

National estimates: WHO TB database (Global TB data collection system)

-Subnational estimates (prevalence): Survey reports/collaborators

Titel

14

Database completeness

WHO TB estimates were produced in 2013 for 217 countries

- countries with complete set of predictors in 2013: 166 Missing data imputation 1. Predictor data only available at set intervals (e.g. every 5 years)

- missing values imputed using a linear imputation. - start and end observations used as anchor points

2. Predictor data missing for the most recent years only

- linear trend imputation to extrapolate the existing series - Only if fit R-sq >90%

Titel

15

Titel

16

Before imputation

World

Bank

GHR

TME

climate, bcg,

diabetes

Darker shades

= data missing X

Lighter shades

= data available

Titel

17

After imputation

World

Bank

GHR

TME

climate, bcg,

diabetes

Missing value

imputation

"successful" for World

Bank and GHR data

Still many missing

covariates for 2013

Incidence: Model inputs and outputs

Outcome variable:

- Num: Estimated number of incident

cases (all forms)

- Den: Estimated total population

Training set:

- First instance 1,688 datapoints over 24

years (1990-2013)

- Final model: 213 datapoints with

complete data

Predictions:

- 2013 estimates for 100 middle income

+ 6 high income countries

- Not all have complete data for

predictions

Titel

18

Percentage of countries with complete covariate data out

of all eligible countries for training set, per year

(73 countries, 213 datapoints)

Prevalence: Model inputs and outputs

Outcome variable - Bacteriologically confirmed (BC) TB prevalence

Training set: - Country estimates from prevalence surveys conducted from 2007

onwards (standardised analysis methodology): 13 countries - Subnational estimates from TB prevalence surveys: 5 countries - India – only district level prevalence survey estimates: 2 districts - Total: 30 datapoints

Predictions - 2013 estimates for 25 low and 49 middle income countries - without prevalence survey - with expected prevalence >0.1% according to WHO estimates

Titel

19

Mortality: model inputs and outputs

Outcome:

- Num: Estimated number of deaths from

TB (all forms, exc. HIV) from vital

registration systems

- Den: Estimated total population

Training set

- First instance 3,022 datapoints over 24

years (1990-2013)

- Final model: 307 datapoints with

complete data

Predictions

- 2013 estimates for 11 high, 42 middle

32 low income countries and 6 countries

with missing income status

- Not all have complete data for

predictions

Titel

20

Percentage of countries with complete covariate data out

of all eligible countries for training set, per year

(128 countries, 307 datapoints)

Titel

21

Selection criteria for predictors in model

1. Predictors selected based on completeness in training dataset - <40% complete excluded for mortality and incidence models - <100% complete excluded for prevalence models

2. Univariate relationships: complete predictors vs. outcome

3. Pairwise correlations

- Identify highly correlated predictors (> 0.8) - drop based on the lowest relative fit vs. outcome

→ Mortality: Reduction of variables from 166 to 37 → Prevalence: Reduction of variables from 166 to 30 → Mortality: Reduction of variables from 166 to 54

Titel

22

Selected predictors (all models combined)

Titel

23

TB outcome data

• All new cases and relapse cases with

unknown treatment history *

• New all forms cases/rate (+ lag 1/lag 2y) *

• New laboratory confirmed cases/rate (+

lags) *

• All notified cases*

• % MDR among new cases

TB programmatic determinants

• Total expenditure on health/expressed as % GDP

• TB patients with HIV test result in TB register

• All previously treated TB cases

• Percentage retreated out of all cases

• Treatment success rate in new and retreatment cases

• Number of laboratories providing tuberculosis

diagnostic services using sputum smear microscopy

• BCG vaccination rates in children

Co-morbidities

• TB patients recorded as HIV positive

• % HIV positive among all patients notified

• Prevalence of diabetes

* Not included in incidence model

Socio environmental factors

• Gross National Income per capita

• Life expectancy at birth (overall, M, F)

• Life expectancy at 60 (overall, M, F)

• Total population (overall, M. F), sex ratio

• % population 15 or younger/60 or older

• Maternal mortality ratio, under-five mortality

• Percentage urban population, population density

• % population with access to improved water/sanitation

• Average temperature in the coldest/warmest month

• Average temperature

• Average precipitation

Titel

24

Model selection approach

Mortality and incidence: - GLM, Poisson, negative binomial and zero inflated distribution

Prevalence: - GLM, binomial (logistic link) and negative binomial distributions

Final multivariate model selected based on the Akaike Information Criterion AIC (Likelihood based)

Titel

25

Titel

26

Final model: incidence (negative binomial)

Titel

27

Model predictors

coefficient

(log scale) strength

(Intercept) -8.02000

Percentage retreatment (out of all cases) 0.02368 23.7

% population 60 yrs + 0.02053 20.5

% MDR out of all new cases 0.00466 4.7

Average temperature in warmest month 0.00311 3.1

TB patients recorded as HIV positive 0.00141 1.4

All previously treated cases 0.00054 0.5

Average precipitation 0.00027 0.3

GNI 0.00001 0.0

TB patients with HIV test result in TB register -0.00005 -0.1

Total expenditure on health -0.00008 -0.1

Life expectancy at 60 in males -0.08271 -82.7

Government expenditure on health (% GDP) -0.10510 -105.1

Diabetes prevalence -2.31000 -2310.0

Table 1. Estimated coefficients (log scale) of final negative binomial multivariate model for incidence (n=213)

Estimate

Std. Error

Pr(>|z|)

(Intercept) -8.02700 4.40E-01 0.00000

totexphaer -0.00008 3.35E-05 0.01352

govexphgdp -0.10510 2.77E-02 0.00015

lifeexp60m -0.08271 1.84E-02 0.00001

pop60 0.02053 7.16E-03 0.00416

tmax 0.00311 6.25E-04 0.00000

ret_af 0.00054 8.47E-05 0.00000

prec 0.00027 4.91E-05 0.00000

hivtest_pos 0.00141 2.22E-04 0.00000

perc_mdr_new 0.00466 2.13E-03 0.02835

gnipp 0.00001 2.39E-06 0.00002

hivtest -0.00005 1.81E-05 0.00447

DM_prev -2.31000 1.25E+00 0.06536

perc_ret 0.02368 8.49E-03 0.00531

Final model: incidence

Titel

28

Predicted vs. observed rate (log scale)

Predicted vs. observed rate

Titel

29

Final model: prevalence (binomial logistic)

Model predictors

coefficient

(log scale) Strength


Climate score 0.16039 160

New laboratory confirmed rate 0.00812 8

BCG coverage -0.03610 -36

Final model: prevalence

Titel

30

0

.00

2.0

04

.00

6.0

08

Pre

dic

ted p

reva

lence

0 .002 .004 .006 .008Observed prevalence

Predicted vs. observed prevalence

-7-6

.5-6

-5.5

-5

Pre

dic

ted p

reva

lence

-7.5 -7 -6.5 -6 -5.5 -5Observed prevalence

Predicted vs. observed prevalence

(logistic scale)

Final model: mortality (negative binomial)

Titel

31

Model predictors

coefficient

(log scale) strength


New all forms rate 0.71356 714

BCG coverage 0.17597 176

TB patients recorded as HIV positive 0.16718 167

Percentage retreatment (out of all cases) 0.15697 157

Precipiation 0.11985 120

% MDR out of all new cases 0.05991 60

Diabetes prevalence -0.09344 -93

Government expenditure on health -0.13399 -134

Urban Population -0.14584 -146

Treatment success rate for re-treatment cases -0.21212 -212

Final model: mortality

Titel

32

Predicted vs. observed rate (log scale)

Predicted vs. observed rate

Model fit: deviance residuals

Titel

33

Incidence Prevalence

0

.00

2.0

04

.00

6.0

08

p_h

at

-5 0 5Deviance residual

Mortality

Model validation: cross validation

1. Split data randomly into k partitions 2. For each partition fit specified model using the other k-1 groups. 3. Pseudo-R-sq: square of correlation predicted vs. observed

Incidence: k=5 R-sq=0.94 Prevalence: k=2, x5 R-sq=0.76 Mortality: k=5 R-sq=0.89

Titel

34

Training set

K=2

Develop model in this

subset

Check predicted vs.

observed in this subset

Titel

35

Discussion

Predictive models could be fitted for all three tasks - Goodness of fit satisfactory - Further refinement of models and database necessary before

predictions can be made

Incidence and mortality: - Include random effects for countries or income status - build one model just on high income countries and one just on

middle income countries and compare with variables/coefficients

Include time lagged variables: - so far only included lagged TB notification rates (1 and 2 years)

New predictor data recently compiled: - Number of large cities >500.000 and >1million inhabitants - Prevalence of prisoners (UNDP) , migrants (World Bank), drug

use (UNODC), refugees and displaced populations (UNHCR)

Titel

36

Thank you

Titel

37

Titel

38

Documents

Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle