38
Predictive statistical modelling approach to estimating TB burden Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker

Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Predictive statistical modelling

approach to estimating TB burden Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker

Page 2: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Overall aim, interim results

Overall aim of predictive models: 1. To enable predictions of TB incidence, prevalence and mortality

- from 1990 to 2015 - for a selection of countries for which the Task Force is

mandate to produce estimates

2. To identify a set of conditions which warrant or do not warrant the use of these models

Focus of this presentation:

- Development of database and model structures - to enable predictions for 2013

2

Page 3: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Approach to predictive modelling

Explanatory models → correctly capture causal pathways Predictive models → maximize predictive power (training set)

Titel

3

Source: Schmueli: “To explain or to predict?"

Page 4: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

4

Page 5: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Task 1: Incidence

Titel

5

Use data from robust sruveillance

systemsIncidence in middle/high

income countries relying

on indirect incidence

estimates mostly from high-income

countries

Page 6: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Incidence: Training set vs. predictions

Titel

6

Page 7: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Task 2: Prevalence

Titel

7

Prevalence for countries where national surveys have

not been implemented

-low/middle income countries

-with predicted prevalence of

over 0.1%

Use data from recent

national TB prevalence

surveys

Page 8: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Prevalence: Training set vs. predictions

Titel

8

Page 9: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Task 3: Mortality

Titel

9

Mortality in countries without

vital registration data

mostly low-income

countries

mostly from middle

and high-income

countries

Use data from countries

using vital registration

systems to estimate

mortality

Page 10: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Mortality: Training set vs. predictions

Titel

10

Page 11: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

11

Page 12: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Conceptual framework

TB outcome data

TB case notification

MDR TB

TB programmatic determinants

Weak health system, case finding ability (suspects, diagnostics)

Poor access to TB services, treatment outcomes

Inappropriate health seeking behavior

BCG vaccination coverage in children

Co-morbidities

HIV

Poor-nutritional status

Diabetes

Lung diseases

Socio environmental factors

Weather: Humidity and temperature

High risk groups: prisoners, homeless people, migrants, drug

addicts, refugees, IDP

Urbanization: population density, poor water source and sanitation,

crowded living conditions, poor ventilation, indoor air pollution

Smoking, alcoholism

Aging populations

Titel

12

Page 13: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Database compilation

Titel

13

National Level Subnational level

-Tuberculosis Monitoring and Evaluation

(TME)-National statistical agencies (e.g. census)

-Global Health Repository (GHR) -Multiple Indicator Surveys (MICS)

-World Bank -Demographic and Health Surveys (DHS)

-UNICEF reports (BCG prevalence) -District level surveys (India)

-International Diabetes Federation -NTPs (TB notifications)

Predictor variables

Outcome variables

National estimates: WHO TB database (Global TB data collection system)

-Subnational estimates (prevalence): Survey reports/collaborators

Page 14: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

14

Page 15: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Database completeness

WHO TB estimates were produced in 2013 for 217 countries

- countries with complete set of predictors in 2013: 166 Missing data imputation 1. Predictor data only available at set intervals (e.g. every 5 years)

- missing values imputed using a linear imputation. - start and end observations used as anchor points

2. Predictor data missing for the most recent years only

- linear trend imputation to extrapolate the existing series - Only if fit R-sq >90%

Titel

15

Page 16: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

16

Before imputation

World

Bank

GHR

TME

climate, bcg,

diabetes

Darker shades

= data missing X

Lighter shades

= data available

Page 17: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

17

After imputation

World

Bank

GHR

TME

climate, bcg,

diabetes

Missing value

imputation

"successful" for World

Bank and GHR data

Still many missing

covariates for 2013

Page 18: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Incidence: Model inputs and outputs

Outcome variable:

- Num: Estimated number of incident

cases (all forms)

- Den: Estimated total population

Training set:

- First instance 1,688 datapoints over 24

years (1990-2013)

- Final model: 213 datapoints with

complete data

Predictions:

- 2013 estimates for 100 middle income

+ 6 high income countries

- Not all have complete data for

predictions

Titel

18

Percentage of countries with complete covariate data out

of all eligible countries for training set, per year

(73 countries, 213 datapoints)

Page 19: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Prevalence: Model inputs and outputs

Outcome variable - Bacteriologically confirmed (BC) TB prevalence

Training set: - Country estimates from prevalence surveys conducted from 2007

onwards (standardised analysis methodology): 13 countries - Subnational estimates from TB prevalence surveys: 5 countries - India – only district level prevalence survey estimates: 2 districts - Total: 30 datapoints

Predictions - 2013 estimates for 25 low and 49 middle income countries - without prevalence survey - with expected prevalence >0.1% according to WHO estimates

Titel

19

Page 20: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Mortality: model inputs and outputs

Outcome:

- Num: Estimated number of deaths from

TB (all forms, exc. HIV) from vital

registration systems

- Den: Estimated total population

Training set

- First instance 3,022 datapoints over 24

years (1990-2013)

- Final model: 307 datapoints with

complete data

Predictions

- 2013 estimates for 11 high, 42 middle

32 low income countries and 6 countries

with missing income status

- Not all have complete data for

predictions

Titel

20

Percentage of countries with complete covariate data out

of all eligible countries for training set, per year

(128 countries, 307 datapoints)

Page 21: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

21

Page 22: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Selection criteria for predictors in model

1. Predictors selected based on completeness in training dataset - <40% complete excluded for mortality and incidence models - <100% complete excluded for prevalence models

2. Univariate relationships: complete predictors vs. outcome

3. Pairwise correlations

- Identify highly correlated predictors (> 0.8) - drop based on the lowest relative fit vs. outcome

→ Mortality: Reduction of variables from 166 to 37 → Prevalence: Reduction of variables from 166 to 30 → Mortality: Reduction of variables from 166 to 54

Titel

22

Page 23: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Selected predictors (all models combined)

Titel

23

TB outcome data

• All new cases and relapse cases with

unknown treatment history *

• New all forms cases/rate (+ lag 1/lag 2y) *

• New laboratory confirmed cases/rate (+

lags) *

• All notified cases*

• % MDR among new cases

TB programmatic determinants

• Total expenditure on health/expressed as % GDP

• TB patients with HIV test result in TB register

• All previously treated TB cases

• Percentage retreated out of all cases

• Treatment success rate in new and retreatment cases

• Number of laboratories providing tuberculosis

diagnostic services using sputum smear microscopy

• BCG vaccination rates in children

Co-morbidities

• TB patients recorded as HIV positive

• % HIV positive among all patients notified

• Prevalence of diabetes

* Not included in incidence model

Socio environmental factors

• Gross National Income per capita

• Life expectancy at birth (overall, M, F)

• Life expectancy at 60 (overall, M, F)

• Total population (overall, M. F), sex ratio

• % population 15 or younger/60 or older

• Maternal mortality ratio, under-five mortality

• Percentage urban population, population density

• % population with access to improved water/sanitation

• Average temperature in the coldest/warmest month

• Average temperature

• Average precipitation

Page 24: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

24

Page 25: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Model selection approach

Mortality and incidence: - GLM, Poisson, negative binomial and zero inflated distribution

Prevalence: - GLM, binomial (logistic link) and negative binomial distributions

Final multivariate model selected based on the Akaike Information Criterion AIC (Likelihood based)

Titel

25

Page 26: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

26

Page 27: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Final model: incidence (negative binomial)

Titel

27

Model predictors

coefficient

(log scale) strength

(Intercept) -8.02000

Percentage retreatment (out of all cases) 0.02368 23.7

% population 60 yrs + 0.02053 20.5

% MDR out of all new cases 0.00466 4.7

Average temperature in warmest month 0.00311 3.1

TB patients recorded as HIV positive 0.00141 1.4

All previously treated cases 0.00054 0.5

Average precipitation 0.00027 0.3

GNI 0.00001 0.0

TB patients with HIV test result in TB register -0.00005 -0.1

Total expenditure on health -0.00008 -0.1

Life expectancy at 60 in males -0.08271 -82.7

Government expenditure on health (% GDP) -0.10510 -105.1

Diabetes prevalence -2.31000 -2310.0

Table 1. Estimated coefficients (log scale) of final negative binomial multivariate model for incidence (n=213)

Estimate

Std. Error

Pr(>|z|)

(Intercept) -8.02700 4.40E-01 0.00000

totexphaer -0.00008 3.35E-05 0.01352

govexphgdp -0.10510 2.77E-02 0.00015

lifeexp60m -0.08271 1.84E-02 0.00001

pop60 0.02053 7.16E-03 0.00416

tmax 0.00311 6.25E-04 0.00000

ret_af 0.00054 8.47E-05 0.00000

prec 0.00027 4.91E-05 0.00000

hivtest_pos 0.00141 2.22E-04 0.00000

perc_mdr_new 0.00466 2.13E-03 0.02835

gnipp 0.00001 2.39E-06 0.00002

hivtest -0.00005 1.81E-05 0.00447

DM_prev -2.31000 1.25E+00 0.06536

perc_ret 0.02368 8.49E-03 0.00531

Page 28: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Final model: incidence

Titel

28

Predicted vs. observed rate (log scale)

Predicted vs. observed rate

Page 29: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

29

Final model: prevalence (binomial logistic)

Model predictors

coefficient

(log scale) Strength

(Intercept) -3.03588

Climate score 0.16039 160

New laboratory confirmed rate 0.00812 8

BCG coverage -0.03610 -36

Page 30: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Final model: prevalence

Titel

30

0

.00

2.0

04

.00

6.0

08

Pre

dic

ted p

reva

lence

0 .002 .004 .006 .008Observed prevalence

Predicted vs. observed prevalence

-7-6

.5-6

-5.5

-5

Pre

dic

ted p

reva

lence

-7.5 -7 -6.5 -6 -5.5 -5Observed prevalence

Predicted vs. observed prevalence

(logistic scale)

Page 31: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Final model: mortality (negative binomial)

Titel

31

Model predictors

coefficient

(log scale) strength

(Intercept) -10.57740

New all forms rate 0.71356 714

BCG coverage 0.17597 176

TB patients recorded as HIV positive 0.16718 167

Percentage retreatment (out of all cases) 0.15697 157

Precipiation 0.11985 120

% MDR out of all new cases 0.05991 60

Diabetes prevalence -0.09344 -93

Government expenditure on health -0.13399 -134

Urban Population -0.14584 -146

Treatment success rate for re-treatment cases -0.21212 -212

Page 32: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Final model: mortality

Titel

32

Predicted vs. observed rate (log scale)

Predicted vs. observed rate

Page 33: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Model fit: deviance residuals

Titel

33

Incidence Prevalence

0

.00

2.0

04

.00

6.0

08

p_h

at

-5 0 5Deviance residual

Mortality

Page 34: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Model validation: cross validation

1. Split data randomly into k partitions 2. For each partition fit specified model using the other k-1 groups. 3. Pseudo-R-sq: square of correlation predicted vs. observed

Incidence: k=5 R-sq=0.94 Prevalence: k=2, x5 R-sq=0.76 Mortality: k=5 R-sq=0.89

Titel

34

Training set

K=2

Develop model in this

subset

Check predicted vs.

observed in this subset

Page 35: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

35

Page 36: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Discussion

Predictive models could be fitted for all three tasks - Goodness of fit satisfactory - Further refinement of models and database necessary before

predictions can be made

Incidence and mortality: - Include random effects for countries or income status - build one model just on high income countries and one just on

middle income countries and compare with variables/coefficients

Include time lagged variables: - so far only included lagged TB notification rates (1 and 2 years)

New predictor data recently compiled: - Number of large cities >500.000 and >1million inhabitants - Prevalence of prisoners (UNDP) , migrants (World Bank), drug

use (UNODC), refugees and displaced populations (UNHCR)

Titel

36

Page 37: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Thank you

Titel

37

Page 38: Predictive statistical modelling approach to estimating TB ... · years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle

Titel

38