7
Policy Analysis Determinants of National Diarrheal Disease Burden SEAN T. GREEN,* ,† MITCHELL J. SMALL, AND ELIZABETH A. CASMAN * Engineering and Public Policy, Carnegie Mellon University, Baker Hall 129, Pittsburgh, PA 15213 Received August 19, 2008. Revised manuscript received November 24, 2008. Accepted December 09, 2008. Diarrheal illness is a leading cause of child mortality in developing nations. Previous longitudinal studies have attempted to identify the factors that contribute to child mortality, but few have examined the determinants of diarrheal illness at a country level. Here we demonstrate the use of Classification and Regression Trees (CART) to predict diarrheal illness from a 192- country data set of country-level attributes and compare the performance of CART with a linear regression model. The CART model identifies improvements in rural sanitation as the most important spending priority for reducing diarrheal illness. We estimate that reducing unmet rural sanitation need worldwide by 65% would save the equivalent of 1.2 million lives annually. Introduction Diarrheal illness is estimated to cause 2.2 million deaths per year and is the third leading cause of child mortality worldwide after neonatal disorders and respiratory infections (1, 2). In general, the poorer the country, the larger the proportion of deaths caused by diarrheal illness. One study estimates that 1.8 million children die each year from diarrheal illness but that less than 1000 of these deaths occur in the developed world (3). According to the World Health Organization, 88-94% of all cases of diarrheal illness worldwide are attributable to modifiable environmental factors (4), a figure which corresponds to 1.5 million preventable fatalities a year. These deaths, which occur mainly in children, are due to inadequate water and sanitation services, and represent nearly 15% of the nearly ten million childhood deaths that occur each year (5). Studies have attempted to predict diarrheal illness morbidity and mortality based on demographic, anthro- pometric, household, or other factors (6-9), and others have used the occurrence of diarrhea to predict other variables related to health (10-12); however, few examples in the literature use national-level indicators to predict diarrhea. Because diarrheal illness is a major cause of child mortality, it is plausible that studies predicting child mortality at the national level may uncover variables useful in predicting the national determinants of diarrhea. In previous regression analysis studies, a small number of country-level attributes have repeatedly been associated with child mortality. The proportion of the population living in urban areas explained 40% of the variation in child mortality in a study of 185 World Bank member countries (13). Another study reported a synergistic interaction between parental education and economic status in reducing child mortality in a group of eight countries in Latin America (9). For a group of 98 countries, 95% of the variation in child mortality was explained by income per capita, income inequality, female literacy, the level of ethno-linguistic fragmentation, and whether or not a country was predominantly Muslim (14). In these studies, as is often the case, data selection played an important role and the results depended upon the countries selected for analysis. The number of countries included in some studies is determined by which are present in a data set, but in other studies, outlying observations may be omitted to meet the assumption of independent and identically distributed observations. For example, in Filmer and Pritchett’s study, they excluded the two most influential countries in their original data set for clarity and consistency in results (14). They acknowledge that outlying observations are interesting because they underperform or overperform relative to a regression trend due to factors not explained by the analysis. Although selection bias is perhaps inevitable in multicountry studies, some reductions in sample size can be avoided by choosing a method of analysis with no distri- butional assumptions. Another problem associated with linear regressions is that they cannot (without the use of special techniques such as 2-Stage Least Squares with instrumental variables) detect the presence of reverse causa- tion between dependent and independent variables. The present work employs the machine learning tech- nique, Classification and Regression Trees (CART) (15-17), to predict the burden of diarrheal disease (2) and compares the CART model results to a linear regression on the same data. The more nuanced results that CART produces may be useful in setting development spending priorities. Data. Although the previous studies did not explicitly involve diarrheal illness, we presume that the factors important to child mortality are important to diarrheal illness as well because of the fraction of child deaths caused by diarrhea. Consequently, the data for this study included all of the variables found to be of significance in past studies of child mortality, except for whether or not a country is predominantly Muslim. Caldwell believed that “predomi- nantly Muslim” is a surrogate variable for practices regarding education and for other factors (18). It may also be indicative of behavior regarding hygiene (19). We excluded “predomi- nantly Muslim” in favor of the percentage of literate adults, the percentage of literate females, and measures of access to water and sanitation in a country, because the literacy variables are likely to be better indicators of national education practices. In addition, access to water and sanita- tion may influence the feasibility and effectiveness of hygiene behavior (20-22). Nevertheless, we acknowledge that a better variable for hygiene behavior is perhaps needed. Several variables were added because of their relevance to preventing diarrheal illness (23-30): measures of internal and external spending on water and health, of recent involvement in wars, of water and sanitation infrastructure, and of the renewable water resources of a country. We have also included Official Developmental Aid (ODA) for Water, a measure of government aid that is part grant and can be used for water supply and sanitation projects, education and training programs, or water resources management and protection (31). The dependent variable for this study is diarrheal illness estimated in disability-adjusted life years (DALYs), and reported for most of the analysis as diarrheal DALYs per million people (dDpm) to facilitate comparison * Corresponding author e-mail: [email protected]. Carnegie Mellon University. 10.1021/es8023226 CCC: $40.75 2009 American Chemical Society VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 993 Published on Web 01/22/2009

Determinants of National Diarrheal Disease Burden

Embed Size (px)

Citation preview

Policy Analysis

Determinants of National DiarrhealDisease BurdenS E A N T . G R E E N , * , † M I T C H E L L J . S M A L L , †

A N D E L I Z A B E T H A . C A S M A N † *

Engineering and Public Policy, Carnegie Mellon University,Baker Hall 129, Pittsburgh, PA 15213

Received August 19, 2008. Revised manuscript receivedNovember 24, 2008. Accepted December 09, 2008.

Diarrheal illness is a leading cause of child mortality indeveloping nations. Previous longitudinal studies have attemptedto identify the factors that contribute to child mortality, butfew have examined the determinants of diarrheal illness at acountry level. Here we demonstrate the use of Classification andRegression Trees (CART) to predict diarrheal illness from a 192-country data set of country-level attributes and comparethe performance of CART with a linear regression model. TheCART model identifies improvements in rural sanitation asthe most important spending priority for reducing diarrhealillness. We estimate that reducing unmet rural sanitation needworldwide by 65% would save the equivalent of 1.2 millionlives annually.

IntroductionDiarrheal illness is estimated to cause 2.2 million deaths peryear and is the third leading cause of child mortalityworldwide after neonatal disorders and respiratory infections(1, 2). In general, the poorer the country, the larger theproportion of deaths caused by diarrheal illness. One studyestimates that 1.8 million children die each year fromdiarrheal illness but that less than 1000 of these deaths occurin the developed world (3). According to the World HealthOrganization, 88-94% of all cases of diarrheal illnessworldwide are attributable to modifiable environmentalfactors (4), a figure which corresponds to 1.5 millionpreventable fatalities a year. These deaths, which occurmainly in children, are due to inadequate water and sanitationservices, and represent nearly 15% of the nearly ten millionchildhood deaths that occur each year (5).

Studies have attempted to predict diarrheal illnessmorbidity and mortality based on demographic, anthro-pometric, household, or other factors (6-9), and others haveused the occurrence of diarrhea to predict other variablesrelated to health (10-12); however, few examples in theliterature use national-level indicators to predict diarrhea.Because diarrheal illness is a major cause of child mortality,it is plausible that studies predicting child mortality at thenational level may uncover variables useful in predicting thenational determinants of diarrhea. In previous regressionanalysis studies, a small number of country-level attributeshave repeatedly been associated with child mortality. Theproportion of the population living in urban areas explained40% of the variation in child mortality in a study of 185 WorldBank member countries (13). Another study reported a

synergistic interaction between parental education andeconomic status in reducing child mortality in a group ofeight countries in Latin America (9). For a group of 98countries, 95% of the variation in child mortality wasexplained by income per capita, income inequality, femaleliteracy, the level of ethno-linguistic fragmentation, andwhether or not a country was predominantly Muslim (14).

In these studies, as is often the case, data selection playedan important role and the results depended upon thecountries selected for analysis. The number of countriesincluded in some studies is determined by which are presentin a data set, but in other studies, outlying observations maybe omitted to meet the assumption of independent andidentically distributed observations. For example, in Filmerand Pritchett’s study, they excluded the two most influentialcountries in their original data set for clarity and consistencyin results (14). They acknowledge that outlying observationsare interesting because they underperform or overperformrelative to a regression trend due to factors not explained bythe analysis. Although selection bias is perhaps inevitable inmulticountry studies, some reductions in sample size can beavoided by choosing a method of analysis with no distri-butional assumptions. Another problem associated withlinear regressions is that they cannot (without the use ofspecial techniques such as 2-Stage Least Squares withinstrumental variables) detect the presence of reverse causa-tion between dependent and independent variables.

The present work employs the machine learning tech-nique, Classification and Regression Trees (CART) (15-17),to predict the burden of diarrheal disease (2) and comparesthe CART model results to a linear regression on the samedata. The more nuanced results that CART produces may beuseful in setting development spending priorities.

Data. Although the previous studies did not explicitlyinvolve diarrheal illness, we presume that the factorsimportant to child mortality are important to diarrheal illnessas well because of the fraction of child deaths caused bydiarrhea. Consequently, the data for this study included allof the variables found to be of significance in past studiesof child mortality, except for whether or not a country ispredominantly Muslim. Caldwell believed that “predomi-nantly Muslim” is a surrogate variable for practices regardingeducation and for other factors (18). It may also be indicativeof behavior regarding hygiene (19). We excluded “predomi-nantly Muslim” in favor of the percentage of literate adults,the percentage of literate females, and measures of accessto water and sanitation in a country, because the literacyvariables are likely to be better indicators of nationaleducation practices. In addition, access to water and sanita-tion may influence the feasibility and effectiveness of hygienebehavior (20-22). Nevertheless, we acknowledge that a bettervariable for hygiene behavior is perhaps needed.

Several variables were added because of their relevanceto preventing diarrheal illness (23-30): measures of internaland external spending on water and health, of recentinvolvement in wars, of water and sanitation infrastructure,and of the renewable water resources of a country. We havealso included Official Developmental Aid (ODA) for Water,a measure of government aid that is part grant and can beused for water supply and sanitation projects, education andtraining programs, or water resources management andprotection (31). The dependent variable for this study isdiarrheal illness estimated in disability-adjusted life years(DALYs), and reported for most of the analysis as diarrhealDALYs per million people (dDpm) to facilitate comparison

* Corresponding author e-mail: [email protected].† Carnegie Mellon University.

10.1021/es8023226 CCC: $40.75 2009 American Chemical Society VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 993

Published on Web 01/22/2009

between countries with different-sized populations. A com-plete list of the variables along with their sources is given inTable 1. The data are from period 2000-2007 except for theGini Coefficient (1950-1995) and Ethno-linguistic Fraction-alization (1960) which did not have data available for thesame period. Although Ethno-linguistic fractionalization datahave no doubt changed some since 1960, Easterly and Levinevouch for the data, which come from ETHNIC, a Soviet dataset from 1960s, and verify it by comparing it with alternativemeasures of ethno-linguistic diversity. They add that theETHNIC data are “highly correlated with non-linguisticmeasures of social polarization” and thus useful for indicatingpossible tensions within a country (25).

The assembled data covered 192 WHO member countries,resulting in a 192 × 15 data matrix.

Some have questioned the validity of cross-nationalindicator studies because of the suspect quality of some dataand have disputed the comparability of national data sets(32, 33); however, these data are the best available. Theincorporation of several sources of data in this analysis limitsthe influence of any one source while the method of analysis,Classification and Regression Trees (CART), was selectedbecause of its ability to address issues of nonlinearity anddifferential influences of variables for different portions of

the input sample space and for different nations. Systematicanalysis with the current data provides a snapshot of thecurrent state of knowledge that can be compared withanalyses using updated data to track changes in under-standing or to highlight errors in previous estimates.

Missing values in the original data set (Table 1) comprised9.4% of the data. In order to include the largest possiblenumber of countries, maximum likelihood estimates formissing values were imputed using the Expectation-Maximization (EM) algorithm (34-37) for all variables exceptfor Official Developmental Assistance for Water which wasimputed using a regression tree (because these data couldnot be transformed to have a Gaussian distribution, aprerequisite for using the EM algorithm) (38).

The use of the EM algorithm did not affect the evaluationof the comparative accuracy of linear regression against CARTsince both methods used data imputed using the EMalgorithm (or CART in the case of one variable), but therepresentativeness of both the CART and linear regressionresults is likely improved by the use of imputed values.

Features of CART. CART has been used to analyze largedata sets in medical, ecological, industrial, and geologicalresearch (39-44) and the method is especially good fordiscerning variable interactions in complex data sets in which

TABLE 1. Explanation of Variables for Country-Level Data for 192 Countries

variable name abbreviation usedin Figure 1 (units) description and source

dependent variableannual diarrheal DALYs per

million peopledDpm (life-years per million

people per calendar year)DALY, or disability life years(55) divided by

the total country population in 2002 in millions (55)

explanatory variables

peace fraction (unitless) number of years from 1996-2005 that thecountry was at peace divided by 10 years (26)

annual official developmentalassistance (ODA) for waterper capita

per capita external water aid(2003 constant us$per capita)

five-year average (2000-2004) of external privatefinancing for water projects within acountry per population in 2003 (31)

renewable water resourcesper capita (m3 per person-year)

the maximum possible water available to a countrybased on a water balance of surface water, extractedgroundwater, and precipitation in 2000 (27)

Gini coefficient (unitless)an index of distribution of income ranging from 0 to 1,

where a score of 1 corresponds to the mostunequal income distribution (24, 28)

rural water access rural water (percent)percentage of people living in a rural area within a

country who have sustainable access to an improvedwater source as of 2006 (30)

urban water access urban water (percent)percentage of people living in an urban area within

a country who have sustainable access to an improvedwater source as of 2006 (30)

rural sanitation access rural sanitation (percent) percentage of people living in rural areas with sustainableaccess to an improved sanitation source as of 2006 (30)

urban sanitation access urban sanitation (percent) percentage of people living in urban areas with sustainableaccess to an improved sanitation source as of 2006 (30)

total health as pct of GDP percentthe sum of government and private expenditure on health

as a percentage of a country’s Gross DomesticProduct,GDP as of 2006 (29)

government contribution as pctof total health spending percent

the sum of payments for health coming from taxes,social security, and external resources divided by thetotal amount spent on health care as of 2006 (29)

external contribution as pct oftotal health spending percent

grants for medical care and goods that come from othergovernments or organizations and are channeledthrough ministries of health or public agencies in acountry divided by the total amount spent onhealthcare as of 2006 (29)

per capita government healthspending international $/capita annual government expenditure for health on a per capita

basis as of 2006 (29)

adult literacy percent estimate of percent of adults in a country who are literateas of 2006 (29)

female literacy percent estimate of percent of females in a country who are literateas of 2007 (23)

ethno-linguisticfractionalization none the probability in 1960 that two people meeting at random

in a country do not speak the same language (25)

994 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 4, 2009

the relationships between variables are different dependingupon the region of the variable space (45). CART is well suitedfor the problem at hand because it accommodates noisy andnonlinear data; its results are invariant to monotonictransformations of predictor variables; and it does not requirethat its predictor variables be normally distributed. BecauseCART’s performance is not degraded by the inclusion ofcorrelated or superfluous variables, large numbers of po-tentially significant factors can be screened together. Thus,clarity in the results of a CART analysis does not solely dependon the judicious selection of candidate independent variablesor the validity of their mathematical transformations to meetnormality assumptions.

The CART algorithm generates a tree diagram whose nodesrepresent self-similar groupings of data (in our case, groupsof countries). CART iteratively chooses the explanatoryvariable value that splits the data into the two most dissimilargroups of countries with respect to the dependent variable(diarrheal DALYs/million people). This process is continuedsequentially along each branch until further bifurcations nolonger yield sufficient differentiation to justify their inclusionin the model.

MethodsStepwise Linear Regression and Data Preparation. Variablesthat were scaled from 1 to 100 were converted to fractionsand transformed to logistic variables, and variables with nological upper bound, including the dependent variable,diarrheal DALYs per million people, were log transformed.

Stepwise Linear Regression analysis on the 192 × 15variable matrix with imputed values was performed usingMATLAB. Coefficients for variables with p values less than0.05 are reported.

CART Calculation. MATLAB’s regression tree functionswere used to compute the CART model, which was prunedto its final size using the Akaike Information Criteria (AIC)(46, 47).

Comparison between Models. The CART and linearregression models were compared with using a version ofthe AIC, the AICc, which incorporates a small-samplecorrection bias.

Use of CART in the Predictive Mode. CART predictionswere made using the minimum AIC regression tree byiterating through each country of the data set, increasing thevalue of rural sanitation such that the unmet rural sanitationneed was decreased by five percent, then finding theregression tree prediction for the country based on the newrural sanitation value. The iteration was repeated for suc-cessive improvements in rural sanitation until the unmetrural sanitation need was reduced to zero.

Results and DiscussionFor comparison, the results of a stepwise linear regressionon the data, transformed as required to meet normalityrequirements, are shown in Table 2. Seven of the fifteenvariables had statistically significant coefficients. Inequalityof income distribution (Gini coefficient) does the most toincrease diarrheal DALYs, while adult literacy and percapita GDP do the most to decrease them. Gini coefficientwas the most influential variable in Filmer and Pritchett’smodel of child mortality (14) and was also important inother studies (48, 49). Although they used a differentdependent variable and a different set of countries, theGini Coefficient is clearly very influential in both regressionanalyses.

TABLE 2. Stepwise Linear Regression of Annual Diarrheal DALYs per Million People

n ) 192, R 2 ) 0.89, MSE ) 0.367 coefficient P value

variable and transformation used to achieve normalityintercept 19.3 <0.0001logit (Gini coefficient) 0.456 0.001logit (Adult Literacy Rate) -0.365 <0.0001log (per capita gross domestic product) -0.347 0.0002logit (total health spending as percent of GDP) -0.342 0.005logit(Urban Water) -0.193 0.001logit (Urban sanitation) -0.175 0.002logit (external resources for health as percent of total healthspending) 0.0828 0.01

FIGURE 1. CART Model determinants of diarrheal disease. The model is read from the top node down. Nodes are annotated with avariable name and a splitting inequality. Countries for which the inequality is true follow the left branch. For countries following theright branch, the inequality is false. The numbers next to arrows indicate the number of countries that fall on each side of theinequality. Terminal nodes are rectangular in shape and are labeled with the average number, in diarrheal DALYs per million (dDpm)people, over all countries at the node.

VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 995

The same data analyzed with CART produced a fittedmodel with a mean squared prediction error (MSE) of 0.225(compared to 0.367 for the linear regression model). Theresulting regression tree is shown in Figure 1. In Figure 1,each node (branch point) in the tree is annotated with thevariable name and the value on which the data are split. Forany country, if the inequality written on the node is true, itfollows the left arrow. If the inequality is false, it follows theright arrow. (The number of countries beneath each branchis shown on the arrows.) For example, the top node splits thedata into countries reporting less than 63% of the ruralpopulation covered by improved sanitation services fromthose countries exceeding this amount of coverage. Theconditions that result in a country being classified in a giventerminal leaf are read like a flowchart from the top variableto the terminal leaf. Terminal nodes (leaves) are annotatedwith the diarrheal DALYs per million people averaged overthe countries assigned to those leaves.

As with the linear regression model, the results of theCART analysis (Figure 1) also include the Gini Coefficient,but its influence is far less prominent. The Gini Coefficientonly discriminates between two leaves on the far bottomright of the figure. For countries where rural population accessto improved sanitation exceeds 63%, with greater than 98%of the urban population having access to water, and receivingmore than six cents per capita in development aid for waterprojects, the subset with a Gini Coefficient less than 0.35average 470 diarrheal DALYs per million population, whilethose with a Gini Coefficientg0.35 average 1600 dDpm. Forthe countries with the highest diarrheal disease burden(bottom left of Figure 1), the Gini coefficient does not appearin the corresponding branches of the regression tree.

The highest average diarrheal disease burden (46 000dDpm) falls on countries with improved sanitation availableto under 63% of the rural population and GDP below $1094per person. For the three groups of countries (with a totalof 54 countries) with the highest diarrheal disease burden(averageg18 000 dDpm), rural sanitation and GDP per capitaare the sole determinants. In addition, a fourth group of sixcountries with a high average burden of 17 000 dDpm ischaracterized by low rural sanitation, somewhat higher GDP,female literacy below 97%, some external support for health,and low rural water access. These are the countries wherereducing the diarrheal disease burden can do the most good.Relative to these 60 countries, the diarrheal disease burdenof countries that fall in the rest of the diagram is between afactor of 2 and a factor of 60 less.

The most important variable for reducing diarrheal illnessin the worst afflicted countries is rural sanitation, a messagethat can be obscured with a “one size fits all” linear regressionmodel. Studies have shown that improving sanitationdecreases levels of diarrheal illness (21, 50) and that reduc-tions in diarrhea from improved water quality only occurwhen improved sanitation is present (6). Esrey found thatexcreta disposal is more important in determining childhealth in developing areas than water supply and that theprovision of water was limited in its effectiveness withoutaccompanying programs to teach proper use (20). The samestudy also found that improvements in sanitation producedlarger reductions in diarrhea among low-income groups thanamong high-income groups, an argument for provision ofsanitation in rural areas when one considers that 76% of thepoor in the developing world live in rural areas (51).

Figure 2 shows the relationship between rural sanitationand diarrheal disease burden. Rural sanitation does notexplain all of the variation in diarrheal disease burden. Infact, for Europe, South America, and Oceania, diarrhealdisease burden varies little over the range of levels of accessto rural sanitation. Even though there appears to be a negativecorrelation between the variables for African nations (more

rural sanitation coverage corresponds to fewer dDpm), thescatter is wide. It is not surprising that a linear regressionmodel on these data would not include rural sanitation.

Figure 3 shows the CART model’s predictions for howannual world diarrheal DALYs would change with reductionsin unmet rural sanitation need, in terms of the per capita(Figure 3a) and the total (Figure 3b) disease burden. Unmet

FIGURE 2. Diarrheal DALYs as a function of rural sanitationcoverage. (2, 30) The vertical line demarcates the level of ruralsanitation, 63%, that corresponds to the split at the root node ofthe CART model.

FIGURE 3. Reduction in diarrheal disease burden as a functionof meeting unmet needs for improved rural sanitation. (a)Disease burden measured in per capita annual diarrhealDALYs, and (b) disease burden measured as total diarrhealDALYs.

996 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 4, 2009

rural sanitation need is calculated as 1 minus the fraction ofpopulation with access to improved sanitation in rural areas.The CART model predicts the steepest decreases in diarrhealDALYs in the range of about 40 to 65% reduction in unmetrural sanitation need in each country and no decreasesbeyond 65% reduction in unmet need. Within this range ofimprovement, countries located at leaves on the left side ofthe CART model in Figure 1 move to leaves on the right sideof the tree. The improvements in diarrheal illness cease whenall countries lie on the right side, a region of the CART modelin which the rural sanitation variable does not appear, andimprovements in rural sanitation have no further effect.

The two continents with the largest annual diarrhealburden, Asia and Africa, would benefit the most from a 65%reduction in unmet rural sanitation need. Africa’s burdenwould decrease from 26 million to 5 million annual diarrhealDALYs, Asia’s burden would decrease from 38 milliondiarrheal DALYs to 18 million diarrheal DALYs, and the totalworld burden would decrease from 66 million to 25 millionannual DALYs. The combined effect of reducing Africa and

Asia’s unmet rural sanitation need is the equivalent of roughly1.2 million lives saved each year. According to the WorldHealth Organization, 33 DALYs corresponds to the death ofa person in infancy, and 36 diarrheal DALYs corresponds tothe death of a person aged 5 to 20 (2). Reducing the unmetrural sanitation need in the rest of the world would save 40000 lives in addition to the 1.2 million in Africa and Asia.

These results are shown on a world map in Figure 4. Ifthe need for rural sanitation in each country could be reducedby 65%, only four countries are predicted to remain in thecategory with the highest diarrheal disease (46,000 dDpm).

Of course, access to rural sanitation could not be improvedwithout affecting the other independent variables, and (cross-sectional) empirical statistical models of this type are usedfor longitudinal prediction only at the analyst’s peril. Inaddition, the stated importance of rural sanitation in thisanalysis does not deny the importance of variables notincluded in this analysis or variables in this analysis whoseimportance were obscured because of correlation withanother variable. Isolating the importance of other prominent

FIGURE 4. Observed and Predicted national rates of Diarrheal DALYs. (a) Baseline Conditions (2002) and (b) Changes in diarrhealdisease burden if there were a 60% reduction in rural sanitation needs Note, the 15 colors correspond to the 15 leaves of theregression tree, so the hue is not scaled to the severity of diarrhea 1 DALYs per million people.

VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 997

variables in the tree would require an exhaustive search oftrees built using every possible combination of the variableswe included. Still, on the basis of the relationship betweenrural sanitation and diarrheal DALYs derived from the currentinput data, and discounting abrupt baseline changes suchas large-scale warfare, these predictions are likely the bestthat can be achieved with current knowledge and data.

We concede that the importance of rural sanitation willdecrease as migrants from rural areas settle in urban areasand stretch the capacity of urban water and sanitationinfrastructure, or settle in peri-urban slums which have pooraccess to water and sanitation. Half of the world’s populationnow lives in cities (52), and as the population in peri-urbanand urban areas increases relative to rural areas, ruralsanitation’s importance will likely be surpassed by otherfactors. Data collection efforts that differentiate betweenurban and peri-urban populations are important becausethey allow a distinction between the unequal level of accessto water and sanitation infrastructure in both areas. None-theless, with nearly three-quarters of the developing world’spoor still living in rural areas, the proportion of poor in urbanareas will not surpass those in rural areas very soon (51).Efforts to improve rural sanitation will likely be effective attargeting the world’s poorest for some time; however,improvements in sanitation that are accompanied by im-provements in water quantity and quality will have an evengreater effect (6, 53). This analysis attempts to consider theinfluence of water availability on diarrheal illness (by theinclusion of a variable measuring water resources), but it hasbeen shown that water availability may also constrain acountry’s ability to support improved sanitation infrastruc-ture that creates a water burden (54). Further work needs tobe done to identify how migration from rural to urban areaswill affect the future tradeoffs between cost of infrastructureservices provision and number of people covered. Ruralsanitation options that do not require water, such asventilated improved pit latrines and composting toilets, maybe more cost-effective than urban and peri-urban sanitationoptions that require more water but reach more people.Nevertheless, the importance in previous studies of im-provements in sanitation at promoting health for all levelsof water availability (6), suggests that at least some level ofimproved sanitation in rural areas will produce health benefitsin afflicted areas.

AcknowledgmentsWe thank Daniel Neill and Rahul Tongia for their helpfulsuggestions. This work was supported by a National ScienceFoundation Graduate Fellowship and by the SteinbrennerInstitute for Environmental Education and Research. M.J.S.was funded by the H. John Heinz III Chair in EnvironmentalEngineering at Carnegie Mellon University.

Supporting Information AvailableDetails relating to model selection and comparison betweenlinear regression and CART models using information theorycriteria. This material is available free of charge via theInternet at http://pubs.acs.org.

Literature Cited(1) WHO. World Health Report Making a Difference; World Health

Organization: Geneva, 1999.(2) Mathers, C. D.; Bernard, C.; Ilburg, K. M.; Inoue, M.; Fat, D. M.;

Shibuya, K.; Stein, C.; Tomijima, N.; Xu, H. “Global burden ofdisease in data sources, methods and results (Global Program onEvidence for Health Policy discussion paper no. 54),”WHO,2004.

(3) Kaler, S. G. Diseases of poverty with high mortality in infantsand children: malaria, measles, lower respiratory infections,and diarrheal illnesses. Ann. N.Y. Acad. Sci. 2008, 1136, 28–31.

(4) Pruss-Ustun, A.; Corvalan, C. “Preventing disease through healthyenvironments: towards an estimate of the environmental burdenof disease,”WHO, 2006.

(5) UNICEF. Progess for Children: A Child Survival Report Card;UNICEF: New York, 2004.

(6) Esrey, S. A. Water, waste, and well-being: a multicountry study.Am. J. Epidemiol. 1996, 143, 608–623.

(7) Victora, C. G.; Vaughan, J. P.; Barros, F. C. The seasonality ofinfant deaths due to diarrheal and respiratory diseases insouthern Brazil, 1974-1978. Bull. Pan. Am. Health Organ 1985,19, 29–39.

(8) Mock, N. B.; Sellers, T. A.; Abdoh, A. A.; Franklin, R. R.Socioeconomic, environmental, demographic and behavioral-factors associated with occurrence of diarrhea in young-childrenin the Republic-of-Congo. Soc. Sci. Med. 1993, 36, 807–816.

(9) Hatt, L. E.; Waters, H. R. Determinants of child morbidity inLatin America: a pooled analysis of interactions betweenparental education and economic status. Soc. Sci. Med. 2006,62, 375–386.

(10) McGarvey, S. T.; Buszin, J.; Reed, H.; Smith, D. C.; Rahman, Z.;Andrzejewski, C.; Awusabo-Asare, K.; White, M. J. Communityand household determinants of water quality in coastal Ghana.J. Water Health 2008, 6, 339–349.

(11) Ricci, J. A.; Jerome, N. W.; Sirageldin, I.; Aly, H.; Moussa, W.;Galal, O.; Harrison, G. G.; Kirksey, A. The significance ofchildren’s age in estimating the effect of maternal time use onchildren’s well-being. Soc. Sci. Med. 1996, 42, 651–659.

(12) Schroeder, D. G.; Martorell, R.; Rivera, J. A.; Ruel, M. T.; Habicht,J. P. Age-differences in the impact of nutritional supplementationon growth. J. Nutr. 1995, 125, S1051-S1059.

(13) Dye, C. Health and Urban Living. Science 2008, 319, 766–769.(14) Filmer, D.; Pritchett, L. The impact of public spending on health:

does money matter? Soc. Sci. Med. 1999, 49, 1309–1323.(15) Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J.

Classification and Regression Trees; Wadsworth: Belmont, CA,1984.

(16) Quinlan, J. R. Simplifying decision trees. Int. J. Human-Comput.Studies 1999, 51, 497-510.

(17) Quinlan, J. R.; Rivest, R. L. Inferring decision trees using theMinimum Description Length principle. Inform. Comput. 1989,80, 227–248.

(18) Caldwell, J. C. Routes to low mortality in poor countries. Popul.Develop. Rev. 1986, 12, 171–220.

(19) Nielsen, M. Childhood diarrhea and hygiene: mothers’ percep-tions and practices in the Punjab, Pakistan; IWMI: Colombo,2001.

(20) Esrey, S. A. Epidemiological evidence for health benefits fromimproved water and sanitation in developing countries. Epi-demiol. Rev. 1986, 8, 117–128.

(21) Esrey, S. A.; Feachem, R. G.; Hughes, J. M. Intervention for thecontrol of diarrheal diseases among young children: improvingwater supplies and excreta disposal facilities. Bull. World HealthOrgan. 1985, 63, 757–772.

(22) Fewtrell, L.; Colford, J., John M. “Water, sanitation, and hygiene:interventions and diarrhea a systematic review and met-ananalysis,” World Bank, 2004.

(23) CIA World Factbook. Central Intelligence Agency, https://www.cia.gov/library/publications/the-world-factbook/index.html (accessed Aug, 2007).

(24) Deininger, K.; Squire, L. A new data set measuring incomeinequality. World Bank Econ. Rev. 1996, 10, 565–591.

(25) Easterly, W.; Levine, R. Africa’s growth tragedy: policies andethnic divisions. Quarterly J. Econ. 1997, 112, 1203–1250.

(26) Gleditsch, N. P.; Wallensteen, P.; Eriksson, M.; Sollenberg, M.;Strand, H. Armed conflict 1946-2001: a new dataset. J. PeaceRes. 2002, 39, 615–637.

(27) UN-FAO. Review of World Water Resources By Country; Foodand Agriculture Organization of the United Nations: Rome, 2003.

(28) World Income Inequality Database v 2.0b. United NationsUniversity World Institute for Development Economics Re-search, http://www.wider.unu.edu/research/Database/en_GB/database/ (accessed May, 2007).

(29) WHO. World Health Statistics; World Health Organization:Geneva, 2006.

(30) Millenium Development Goals MDGInfo Database. WorldHealth Organization, http://mdgs.un.org/unsd/mdg/default.aspx (accessed Aug, 2006).

(31) WWC Official Development Assistance for Water from 1990 to2004; World Water Council-World Water Forum, 2006.

(32) Srinivasan, T. N. Data base for development analysis: anoverview. J. Develop. Econ. 1994, 44, 3–27.

998 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 4, 2009

(33) Atkinson, A. B.; Brandolini, A. Promise and pitfalls in the useof “secondary” data-sets: income inequality in OECD countriesas a case study. J. Econ. Lit. 2001, 39, 771–799.

(34) Schmee, J.; Hahn, G. J. A simple method for regression analysiswith censored data. Technometrics 1979, 21, 417–432.

(35) Smolinski, A.; Walczak, B.; Einax, J. W. Exploratory analysis ofdata sets with missing elements and outliers. Chemosphere 2002,49, 233–245.

(36) Stanimirova, I.; Serneels, S.; Van Espen, P. J.; Walczak, B. Howto construct a multiple regression model for data with missingelements and outlying objects. Anal. Chim. Acta 2007, 581, 324–332.

(37) Xiao-Li, M.; van Dyk, D. The EM algorithm--an old folk-songsung to a fast new tune. J. Royal Stat. Soc. Ser. B (Methodol.)1997, 59, 511–567.

(38) Lakshminarayan, K.; Harp, S. A.; Samad, T. Imputation of missingdata in industrial databases. Appl. Intell. 1999, 11, 259–275.

(39) Abu-Hanna, A.; de Keizer, N. Integrating classification trees withlocal logistic regression in Intensive Care prognosis. Artif. Intell.Med. 2003, 29, 5–23.

(40) De’ath, G.; Fabricius, K. E. Classification and regression trees:a powerful yet simple technique for ecological data analysis.Ecology 2000, 81, 3178–3192.

(41) Firth, L.; Hazelton, M. L.; Campbell, E. P. Predicting the onsetof Australian winter rainfall by nonlinear classification. J. Climate2005, 18, 772–781.

(42) Lemon, S. C.; Roy, J.; Clark, M. A.; Friedmann, P. D.; Rakowski,W. Classification and regression tree analysis in public health:methodological review and comparison with logistic regression.Ann. Behav. Med. 2003, 26, 172–181.

(43) Tittonell, P.; Shepherd, K. D.; Vanlauwe, B.; Giller, K. E.Unravelling the effects of soil and crop management on maizeproductivity in smallholder agricultural systems of westernKenya - An application of classification and regression treeanalysis. Agric. Ecosyst. Environ. 2008, 123, 137–150.

(44) Valera, V. A.; Walter, B. A.; Yokoyama, N.; Koyama, Y.; Iiai, T.;Okamoto, H.; Hatakeyama, K. Prognostic groups in colorectalcarcinoma patients based on tumor cell proliferation andclassification and regression tree (CART) survival analysis. Ann.Surg. Oncol. 2007, 14, 34–40.

(45) Zanakis, S. H.; Becerra-Fernandez, I. Competitiveness of nations:a knowledge discovery examination. Eur. J. Oper. Res. 2005,166, 185–211.

(46) Akaike, H. A new look at the statistical model identification.IEEE Trans. Auto. Control 1974, AC-19, 716–723.

(47) Burnham, K. P.; Anderson, D. R. Multimodel inference: un-derstanding AIC and BIC in model selection. Sociol. MethodsRes. 2004, 33, 261–304.

(48) Bidani, B.; Ravallion, M. Decomposing social indicators usingdistributional data. J. Economet. 1997, 77, 125–139.

(49) Heerink, N.; Folmer, H. Income distribution and the fulfillmentof basic needs: theory and empirical evidence. J. Policy Model.1994, 16, 625–652.

(50) Esrey, S. A. Effects of improved water supply and sanitation onascariasis, diarrhoea, dracunculiasis, hookworm infection,schistosomiasis, and trachoma. Bull. World Health Organ. 1991,69, 609–621.

(51) Ravallion, M.; Sangraula, S. C. P. New evidence on theurbanization of global poverty. Popul. Develop. Rev. 2007, 33,667–701.

(52) Bloom, D. E. Urbanization and the Wealth of Nations. Science2008, 319, 772–775.

(53) Esrey, S. A.; Habicht, J.-P. Maternal literacy modifies the effectsof toilets and piped water on infant survival in Malaysia. Am. J.Epidemiol. 1988, 127, 1079–1087.

(54) Fry, L. M.; Mihelcic, J. R.; Watkins, D. W. Water and nonwater-related challenges of achieving global sanitation coverage.Environ. Sci. Technol. 2008, 42, 4298–4304.

(55) WHO. World Health Report Changing History World HealthOrganization: Geneva, 2004.

ES8023226

VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY 9 999