Statistical Modeling of Fire Occurrence Using Data from the Tōhoku, Japan Earthquake ...scawthornporter.com/documents/Andersonetal2015... · 2019-05-24 · The 2011 Tohoku, Japan

Risk Analysis DOI: 10.1111/risa.12455

Statistical Modeling of Fire Occurrence Using Data fromthe Tohoku, Japan Earthquake and Tsunami

Dana Anderson,1 Rachel A. Davidson,1,∗ Keisuke Himoto,2 and Charles Scawthorn3

In this article, we develop statistical models to predict the number and geographic distribu-tion of fires caused by earthquake ground motion and tsunami inundation in Japan. Usingnew, uniquely large, and consistent data sets from the 2011 Tohoku earthquake and tsunami,we fitted three types of models—generalized linear models (GLMs), generalized additivemodels (GAMs), and boosted regression trees (BRTs). This is the first time the latter twohave been used in this application. A simple conceptual framework guided identification ofcandidate covariates. Models were then compared based on their out-of-sample predictivepower, goodness of fit to the data, ease of implementation, and relative importance of theframework concepts. For the ground motion data set, we recommend a Poisson GAM; forthe tsunami data set, a negative binomial (NB) GLM or NB GAM. The best models gen-erate out-of-sample predictions of the total number of ignitions in the region within one ortwo. Prefecture-level prediction errors average approximately three. All models demonstratepredictive power far superior to four from the literature that were also tested. A nonlinearrelationship is apparent between ignitions and ground motion, so for GLMs, which assume alinear response-covariate relationship, instrumental intensity was the preferred ground mo-tion covariate because it captures part of that nonlinearity. Measures of commercial exposurewere preferred over measures of residential exposure for both ground motion and tsunamiignition models. This may vary in other regions, but nevertheless highlights the value of test-ing alternative measures for each concept. Models with the best predictive power includedtwo or three covariates.

KEY WORDS: Boosted regression tree; earthquake; fire; generalized additive model; generalized linearmodel

1. INTRODUCTION

The 2011 Tohoku, Japan earthquake andtsunami caused at least 348 reported fires—more than any other earthquake in history. By

1Department of Civil and Environmental Engineering, Universityof Delaware, Newark, DE, USA.

2Building Research Institute, Tachihara 1, Tsukuba, Ibaraki 305-0802, Japan.

3SPA Risk LLC, San Francisco, CA, USA and Kyoto University(retired), Kyoto, Japan.

∗Address correspondence to Rachel A. Davidson, Depart-ment of Civil and Environmental Engineering, University ofDelaware, Newark, DE 19716, USA; tel: +1-302-831-4952;[email protected].

comparison, approximately 285 were documented inKobe, Japan (1995), 12 in Niigata, Japan (1964),(1)

110 in Northridge, California (1994), and 26 inLoma Prieta, California (1989).(2) The Tohokufires occurred in a variety of land area types fromurban to rural, and were caused by two distincthazards—ground motion and tsunami inunda-tion. As they were all part of a single event, thedata set describing the fires could be collectedat one time, ensuring a consistency not possiblewhen compiling data from multiple events overmany years. Because of the size and features ofthe fire data set it generated, this event offersa unique opportunity to improve the statistical

1 0272-4332/15/0100-0001$22.00/1 C© 2015 Society for Risk Analysis

2 Anderson et al.

models of postearthquake ignitions that rely onsuch data and that are critical for planning for theemergency response needs and total losses that canresult from such fires.

Previous efforts to model postearthquake igni-tions fall into two main categories.(3) In the first,(4–7)

the probabilities of different mechanisms of ignition(e.g., utility damage, overturning of objects) areestimated separately and combined using fault orevent trees. In the second category, which dates backto Kawasumi,(8) statistical models are developedusing data from past earthquakes. Although theydiffer in the data and the specific response variablesand covariates they use, almost all previous statisticalmodels have regressed some measure of ignitionrate on a single measure of earthquake intensity(e.g., ignitions per sq ft of building area vs. peakground acceleration, PGA), apparently using leastsquares regression.(9–11) Ren and Xie(12) estimate thenumber of ignitions in each geographic area unit asthe product of ignition rate from the regression andtotal building area of the unit. Others(2,13,14) thensimulate ignitions for each area unit assuming theyfollow a Poisson process with that product as thePoisson parameter. Cousins and Smith(15) assumeignitions are normally distributed with mean ignitionrate from the regression and a standard deviationof 1. Davidson(16) introduced the use of generalizedlinear models (GLMs) and generalized linear mixedmodels (GLMMs) for this application for the firsttime. Unlike previous models, the approach usesdiscrete, nonnegative ignition counts as the responsevariable, examines many possible covariates, anduses a small unit of study to ensure homogeneity invariable values for each area unit. Nevertheless, thedata in that analysis were only available for selectedjurisdictions for each of six earthquakes, which madeit impossible to fully capture the zero counts, i.e., theplaces where ground shaking was strong enough tocause ignitions but did not. The analysis also did notfully characterize the model’s predictive ability without-of-sample validation.

Using a new data set compiled for the Tohokuearthquake and tsunami, this article offers two maincontributions. First, we introduce for this applica-tion two additional model types—generalized addi-tive models (GAMs) and boosted regression trees(BRTs)—and compare their performance with theGLMs found to be best in Davidson.(16) In the pro-cess, we also improve estimation of each model’s pre-dictive power, i.e., how well it will predict the numberand locations of ignitions in a future earthquake and

the magnitude and type of errors to expect. Second,we introduce and compare new models for ground-motion- and tsunami-generated postearthquake ig-nitions in Japan. The best ground motion modelsdemonstrate predictive power far superior to thoseavailable in the literature. To our knowledge, no suchmodel for tsunami-generated ignitions previously ex-isted for any region. In Sections 2 and 3, we describethe data and the three types of statistical models,respectively. The model section process is discussedin Section 4, followed by the analysis results for theground motion and tsunami ignition models, respec-tively, in Sections 5 and 6.

2. DATA DESCRIPTION

2.1. Data Compilation

The study area was defined to be the regionin which ignitions were physically possible, whichwe assumed to be composed of the 17 prefecturesthat experienced at least one damaged buildingor fatality (not including Hokkaido) (Fig. 1). Thisarea, which includes 126,768 km2 and 54.7 millionpeople (34% of the total area and 43% of the totalpopulation of Japan, respectively), correspondsapproximately to the region that experiencedPGA � 0.035 g, based on data used in the analy-sis (Section 2.3). Municipalities were taken to bethe area units for the analysis, but for the ninelargest cities in the area that are administrativelydivided into ku’s (Chiba, Hamamatsu, KawasakiSagamihara, Saitama, Sendai, Shizuoka, Tokyo, andYokohama), we used the ku’s instead. These areaunits were chosen because many variables are avail-able for municipality/ku, and they are of similar sizefor analysis and small enough so that covariates areapproximately homogeneous within them. Together,701 municipalities and 56 ku’s cover the entire studyregion. Because 72 municipalities and 6 ku’s weremissing ignition data (Section 2.2), the final data setincluded 629 municipalities and 50 ku’s (Fig. 1).

Overlaying data from several sources in a ge-ographic information system (GIS), we compiled adata set that includes a value for each variable inTable I for each area unit. For ground motion and in-undation covariates, which varied spatially, we usedthe average over each area unit and over the in-undated portion of each area unit, respectively. Foranalysis, because of the differences in ignition mech-anisms, the data set was divided into the ground

Statistical Modeling of Postearthquake Fire Occurrence 3

Fig. 1. Map of study area and its location in Japan(inset).

motion data set, which included the 615 area unitswith no tsunami inundation ( xarea = 0 ) and onlyignitions identified as ground motion generated; andthe tsunami data set composed of the 64 area unitswith some inundation (xarea > 0 ) and only ignitionsidentified as tsunami generated. The two were ana-lyzed separately. Anderson(17) includes the completedata sets.

2.2. Ignition Data

Ignition data were collected by the Committeefor Postearthquake Fire Research of the Japan As-sociation for Fire Science and Engineering (JAFSE),which included Himoto, co-author of this article.They sent hardcopy mail surveys to all 297 fireservices in the 17 affected prefectures.(11) Surveyswere mailed in April–May 2012, and reminders weresent in the end of 2012 to the beginning of 2013.In all, 258 surveys (87%) were returned complete.The survey included 12 open-ended questions ask-ing the number of ignitions, and for each ignition,its location (street address), whether it was withinthe tsunami-inundated area or not, estimated occur-rence time, reported time, apparent cause, fire type,consequent losses, and how firefighting activity wasconducted.(18) Many of the questions were extracted

from the official “Kasai Hokoku” fire report that fireservices are required to submit to the larger regionalagencies.

The 78 (10%) area units for which ignition dataare missing appear to be randomly distributed ge-ographically (Fig. 1), and have similar distributionsfor the covariates as the area units for which dataare available.(17) With no reason to believe other-wise, we thus assume that the missing data are whatis known as missing completely at random (MCAR),i.e., unrelated to the number of ignitions or to thecovariate values.(19) The observed data can then bethought of as a random subsample of the hypothet-ically complete data, and omitting those area unitsfrom the analysis should not introduce bias in theresults.

For consistency, we included only ignitions that(1) were identified as ground motion or tsunami gen-erated (as determined by the JAFSE team basedon the specified cause), and (2) occurred within10 days of the earthquake (i.e., by 11:59 pm March21).(11) The 10-day cutoff includes most reportedearthquake- or tsunami-generated ignitions (84%),and could be considered the ignitions that would beof most interest to emergency responders and riskmanagers. The data set includes ignitions that self-extinguished or were extinguished by citizens, but

4 Anderson et al.

Table I. Definition, Mean, and Coefficient of Variation (COV) of Variables Used in Each Data Set

Ground Motion Data Set Tsunami Data Set

Concept Variable Definition Mean COV Mean COV

Ignitions yg Ground motion ignitions in 10 days 0.2 3.3 – –yt Tsunami ignitions in 10 days – – 1.9 2.0

Ground motion xpsa03 Average PSAa, 0.3 s (g) 0.4 0.9 0.8 0.5xpsa10 Average PSA, 1 s (g) 0.2 0.6 0.3 0.5xpsa30 Average PSA, 3 s (g) 0.1 0.5 0.1 0.4xpgv Peak ground velocity, PGV (cm/s) 18.8 0.5 31.5 0.3xpga Peak ground acceleration, PGA (g) 0.2 0.8 0.4 0.6xii Instrumental intensity 6.1 0.2 7.4 0.1

Inundation xarea Area that experienced inundationb (m2) – – 7,935,734 1.4xdepth Average inundation depthc (m) – – 2.8 0.7

Exposure xpop Populationb 83,033 1.5 57,284 1.2xres Area of residential zoningb (1,000s m2) 7,054 1.5 7,884 1.4

xestab Number of business establishmentsb 3,846 1.5 2,698 1.3xcom Area of commercial zoningb (1,000s m2) 805 1.6 819 1.5xindus Area of industrial zoningb (1,000s m2) 1,948 1.7 3,638 1.8

Vulnerability xpwood % houses that are wooden 20.4% 1.0 28.6% 0.9xpdam3 % houses collapsed 0.1% 6.9 4.2% 2.3xpdam2 % houses with moderate damage 0.4% 6.4 2.0% 1.9xpdam1 % houses with minor damage 2.0% 3.5 4.6% 1.8

xpdam123 % houses with at least minor damage 2.4% 3.5 10.8% 1.5Damaged buildings xdam3 Num. collapsed housesb 10.2 11.1 572 2.5

xdam2 Num. houses with moderate damageb 82.1 11.2 883 4.6xdam1 Num. houses with minor damageb 349 5.0 2,149 3.4

xdam123 Num. houses with at least minor damageb 441 6.1 3,605 3.3xh Num. housesd 35,534 1.5 24,847 1.3

aPSA = pseudo-spectral acceleration.bWe took the natural log of all exposure variables, damage variables, and xarea before using them in models.cAveraged only over the portion of the area unit with nonzero inundation depth.dNumber of houses was used to compute xpwood and to apply the models from the literature (Section 4.2).

not those that did not require firefighting from thebeginning. The final ground motion data set included528 (86%) area units with zero ignitions, 54 with oneignition, 23 with two, four with three, three with four,and one each with five, six, and 11 ignitions. Thetsunami data set included 33 (52%) zero counts, 11area units with one ignition, six with two, five withthree, and nine with four or more (with a maximumof 22 ignitions in one area unit).

2.3. Covariate Data

The choice of candidate covariates was guided bya simple conceptual framework describing the causesof ignitions (Fig. 2), together with data availability.The main reported mechanisms of postearthquakefires in the absence of a tsunami include damageto electric power and gas lines, and overturning orfalling of appliances, heating equipment, candles, or

other contents.(2,7) Ignitions can be caused directlyby building damage (e.g., damage to a wall may causea water heater to overturn and ignite), but they canalso occur in the absence of building damage if, forexample, an appliance overturns due to accelerationin an otherwise undamaged building. In the case oftsunami inundation, additional ignition mechanismsinclude damage to liquefied propane gas cylindersfor home heating, automobiles, and oil tanks.(20,21)

All these types of ignitions ultimately are expectedto be directly related to hazard, exposure (buildings,their contents, and the utilities and cars associatedwith them), and building vulnerability. We identifiedmultiple possible covariates to represent each ofthese concepts (Fig. 2; Table I). Within a concept(e.g., ground motion), the candidate covariateslisted are alternative measures of the same idea,and as such are highly correlated (i.e., ρ ≥ 0.75 ).(17)

The variance inflation factors (VIFs) are also quitehigh. For a negative binomial (NB) GLM with all


Fig. 2. Conceptual framework of igni-tions and candidate covariates (variabledefinitions in Table I).

covariates, the average VIF across covariates is474, and for all but two covariates, they are over2.5. By contrast, in the models considered in thecross-validation study, in which no more than onecovariate is chosen for each concept, the VIFsare all approximately 1.0, indicating no multi-collinearity problems. Given that the covariateswithin a concept are alternative indicators of thesame idea and are highly correlated, the intention isnot to include all of them in a single model, but foreach concept to choose the most promising covariateof the alternatives. Thus, we defined the followingrules for inclusion of the covariates in a model:include at most one of the six ground motion covari-ates; at most one of the residential or commercialexposure covariates; at most one of the building dam-age covariates; and at most one of the percentage ofbuilding damage covariates. The framework also sug-gested investigation of interactions between hazard,exposure, and building vulnerability. This approachwas developed to avoid blind data mining and multi-colliearity problems, and to end up with a model thatcould be easily interpreted and that is not more com-plex than necessary to facilitate future application.

Data for the ground motion and tsunami in-undation hazard covariates were found from theU.S. Geological Survey Shakemap archives(22) andGeospatial Information Authority of Japan,(23) re-spectively. Population and establishment data werecollected from the Japanese Statistics Bureau.(24)

Data on number of wooden buildings, number ofhouses (to normalize wooden buildings), and zon-ing were all collected from the Ministry of Land and

Infrastructure.(25,26) The vulnerability to damage anddamaged buildings information was found throughthe Fire and Disaster Management Agency.(27) Toavoid numerical problems associated with highlyskewed distributions, before fitting any models, wetook the log of all exposure and damage covariates,as well as area of tsunami inundation xarea (e.g., usinglxres= ln(xres) instead of xres).

3. STATISTICAL MODEL TYPES

The goal of this analysis was to develop a modelto predict the expected number of ignitions in eacharea unit i as a function of the attributes of thearea unit captured in the covariate vector

⇀x i . Three

model types were considered—GLMs, GAMs, andBRTs. These were selected to represent examplesof parametric, nonlinear, and nonparametric modeltypes, respectively. The GLMs allow comparisonwith a previous similar analysis for California.(16)

The GAMs extend those to allow the possibility ofnonlinearity in the covariates, which was considereda potential issue. The BRTs provide an example of adifferent, nonparametric approach. Although othertypes of methods are certainly possible (e.g., randomforest, multivariate adaptive regression splines, orsupport vector machines), these three provide a rea-sonable diversity of approaches given the nature ofthe problem. All models were fitted using R softwareversion 3.0.2 using default settings except wherenoted.(28) Poisson GLMs, NB GLMs, GAMs, andBRTs were fitted using the glm {stats}, glm.nb {MASS

6 Anderson et al.

v7.3-29}, gam {mgcv v1.7-27}, and gbm {gbm v2.1-0.3}functions (in the noted package), respectively.(29–32)

3.1. Generalized Linear Models

GLMs are a generalization of ordinary linear re-gression that allow for response variables that arenot normally distributed.(33) Poisson and NB GLMsare types of GLMs that are useful when the responsevariable represents nonnegative, integer count data,as in this analysis. In a Poisson GLM, the observa-tions of the response variable yi are assumed to begenerated from a Poisson distribution, and the meanof the distribution, μi ≡ E[yi |⇀x i ], is related to covari-

ates⇀x i through Equation (1), where

⇀

β is a vector ofunknown parameters.(34) Thus, the parameter μi ofthe distribution (the estimated mean) varies by areaunit i depending on the values of the covariates

⇀x i

for that area unit.

ln (μi ) = ⇀

xT

i

⇀

β = β0 +m∑

j = 1

β j x ji (1)

An NB GLM is similar except that the counts yi

are assumed to be generated from an NB distribu-tion. A Poisson GLM includes the implicit assump-tion that, conditional on the covariates, the mean andvariance are equal. In reality, data are often overdis-persed (i.e., the conditional variance is greater thanthe conditional mean), and thus the Poisson assump-tion is not appropriate. In that case, an NB GLM maybe preferred because in the NB model, the condi-tional mean is still μi , but the conditional varianceis as defined in Equation (2), where α ≥ 0 is theoverdispersion parameter. The NB also has a largerexpected proportion of zero counts and a thickerright tail than the Poisson.(34)

Var [yi |⇀

xi ] = μi + αμ2i (2)

When α = 0, the NB model reduces to the Pois-son, so a larger estimated α value indicates NB is amore appropriate model than a Poisson. In this anal-ysis, preliminary modeling suggested that the PoissonGLM was not appropriate, and thus we focused onNB GLM models.

3.2. Generalized Additive Models

GAMs are an extension of GLMs in which theonly change is that the linear terms in Equation(1), β j x ji , are replaced by nonparametric smoothfunctions, s j (xji ). (30) Although different smooth

functions can be selected, we used thin plate regres-sion splines, which are the default in the {mgcv} pack-age. The benefit of a GAM over a GLM is that itallows more flexibility in the dependence of the re-sponse on the covariates without requiring specifica-tion of detailed parametric relationships. The chal-lenge is that GAMs require determining how to rep-resent the smooth functions and how smooth theyshould be. If any smooth functions were allowed,maximum likelihood estimation would tend to esti-mate ones that overfit the data. To avoid this, {mgcv}uses penalized likelihood maximization, in which apenalty for each smooth function is added to themodel (negative log) likelihood to penalize its “wig-gliness” (or lack of smoothness). For each term, asmoothing parameter controls the tradeoff betweensmoothness and goodness fit. In {mgcv}, smoothingparameters are estimated automatically so as to mini-mize prediction error, as measured with an un-biasedrisk estimator (UBRE) criterion when the scale pa-rameter is known (or GCV if the scale parameteris not known, as in the case of the quasi-Poissonor NB model). UBRE is a linear transformation ofthe Akaike information criteria (AIC).(30) Based onearly results, we also set γ = 2 to avoid overfittingthe smooths. The parameter γ ≥ 1 is a factor that in-flates the degrees of freedom in the UBRE or GCVscore, thus encouraging a smoother model.(30)

3.3. Boosted Regression Trees

BRTs is a newer method that is fundamentallydifferent than GLMs and GAMs because instead ofaiming to fit a single best model to relate a responsevariable to a set of covariates, BRTs work by fittingand combining a large number of relatively simplemodels. We chose to try BRTs for this applicationbecause they are able to handle interactions and non-linear relationships well, and at least in some cases,they have been shown to provide better predictionsthan GLMs and GAMs.(35) BRTs combine statisticaland machine learning techniques through the use oftwo algorithms—classification and regression trees(CART) and boosting.(36,37) The former is used todevelop individual models; the latter builds andcombines collections of those individual models tooptimize predictive performance.

CARTs is a nonparametric, rule-based classi-fication technique that groups observations withsimilar values of the response variable using a binaryrecursive partitioning algorithm.(37,38) Starting withthe entire data set, the algorithm selects a covariateto be the basis of the first split and a value of


that covariate to be the split point that defines thetwo groups. Similar binary partitions are appliedrecursively until a stopping criterion is reached,with splitting variables and split points determinedeach time so as to minimize prediction errors. Toavoid overfitting, a single tree is often fitted by firstgrowing a large tree, then pruning it by collapsingbranches that have the least power to classify in-stances. The result is a tree that specifies how togroup observations based on their values on each ofthe splitting variables. For regression trees (whichare used in this analysis), the response for eachobservation is assumed to be the mean response ofall observations in the same terminal node of thetree. Regression trees are intuitive, unaffected bychanging covariate measurement scales, and insen-sitive to outliers or inclusion of extra covariates, butthey exhibit relatively poor predictive performance.Combining CARTs with boosting aims to overcomethat limitation. Boosting is an iterative, forward,stage-wise procedure. In the first step, a regressiontree is fitted to the data to minimize a selected lossfunction (deviance is used in the {gbm} package). Ineach subsequent step, a tree is fitted to the residualsof the previous trees and the new tree is added to allthe previous trees (which are left unchanged). Thefinal BRT model is a linear combination of manyregression trees (usually thousands).

Because our response variable, number of ig-nitions in area unit i, is count data, we specifiedthe loss function to be the Poisson deviance.(32) Al-though the variance of the response is greater thanthe mean as assumed by a Poisson (2.7 times greaterfor the ground motion data set), Poisson devianceis the most appropriate loss function for count datain BRTs and thus is typically used.(31,32) We set thebag fraction to the default value of 0.5, meaning thatat each step, a randomly selected 50% of the train-ing set data is used. This typically improves speedand accuracy.(39) Fitting BRTs requires setting threemain parameters: (1) learning rate or shrinkage lr,which determines the contribution of each tree to themodel; (2) tree complexity or interaction depth tc,which indicates the number of splits used for eachtree; and (3) number of trees or iteration nt, whichindicates the maximum number of trees used.

4. MODEL SELECTION PROCESS

The aim of the selection process was to iden-tify and compare the best models based on predic-tive power, fit to the sample data, and ease of use.

There were two main steps: (1) preliminary analysesfor each model type, and (2) cross-validation analysisto compare the best models of each type, plus a fewothers of interest.

4.1. Preliminary Analyses for Each Model Type

For the GLMs, we first selected a covariate torepresent each concept in the framework (Fig. 2). Be-cause we wanted to include at most one of the six can-didate ground motion covariates (see rules in Section2.3), for example, we first compared models that werethe same except for the covariate used to representground motion. We then chose the most promisingmodels considering only that smaller set of covariatesand including the possibility of interactions suggestedby the framework (hazard covariate*exposure co-variate, or hazard*exposure*building vulnerabilitycovariates). In all cases, GLMs were compared basedon the following goodness-of-fit metrics: (1) deviancepseudo-R2, R2

dev ; (2) pseudo- R2 based on α, R2α ; and

(3) Akaike information criteria, AIC. In linear re-gression models, R2 represents goodness-of-fit as thepercentage of variability in the observation yi thata model explains. Alternative pseudo-R2 statisticshave been developed for nonlinear and discretecount models. The deviance pseudo-R2 is defined asR2

dev = 1 − [D(y, μ)/D(y, y)], where D(y, μ) is thedeviance for the fit model and D(y, y) is the deviancefor the intercept-only model (a model with μi = yfor all i). It measures reduction in deviance due tothe inclusion of covariates, and generally increaseswith the addition of new covariates, but cannot beused to compare Poisson and NB models.(40) For NBmodels, R2

α = 1 − (α/α0), where α and α0 are theoverdispersion parameters (Equation (2)) from thefit model and the intercept-only model, respectively,is an appropriate measure as well. Although bothR2

dev and R2α are always from zero to one, the former

measures the fraction of total variation in the rawcounts yi explained by the model, and the lattermeasures the fraction of the variation in the truePoisson means explained by the model. When themean count μi is small (as in this case), most ofthe variability in the counts yi can be due to the Pois-son variability, and thus R2

dev are much lower thanR2

α. The AIC is defined as AIC = −2logL+ 2q,where logL is log-likelihood, q is number of inde-pendent parameters, and a smaller AIC indicatesthe preferred model.(41) Unlike the pseudo-R2 met-rics, AIC penalizes more complicated models. Withan aim to fit the data well without overfitting them,

8 Anderson et al.

we also selected the promising GLMs based on apreference for simpler models when two fit the dataequally well. In some cases, models were includedfor purposes of comparing with other model typesusing the same terms. A similar process was followedwith the GAMs, but using the UBRE/GCV as anadditional goodness-of-fit metric (Section 3.2).

For the BRTs, the preliminary analysis be-gan with the smaller sets of covariates (one foreach concept) determined for the GLMs andGAMs, and then focused on determining ap-propriate values of the three main modelingparameters—learning rate lr, tree complexity tc,and maximum number of trees nt. As suggestedin Elith et al.,(36) we used cross-validation withdeviance reduction as the measure of success tocompare all combinations of values of learning rate(0.01, 0.005, 0.001), tree complexity (1, 2, 3, 5),and maximum number of trees (15,000), and thuschoose the optimal ones for our data sets.(17) Basedon that analysis, we used lr = 0.001, tc = 2 or 3, andnt = 15,000 for all ground motion runs; and lr = 0.001or 0.0005, tc = 3, and nt = 15,000 for all tsunamiruns. However, because the BRT can overfit the datawhen too many trees are included, within each run,the 10-fold cross-validation option of gbm was usedto identify the optimal number of trees. Note thatthis cross-validation exercise to determine lr, tc, andnt is separate from that described in Section 4.2.

4.2. Cross-Validation Analysis

The best models of each of the three types werethen compared, plus a few additional models of in-terest. To compare predictive power of the models,we conducted a 10-fold cross- validation (CV). Theobservations were partitioned into 10 randomly sam-pled folds. For each fold, the 90% of observationsnot in the fold comprised the training set. They wereused to fit the models, which were then applied topredict values for each of the 10% of observationsin the fold—the validation set. In this way, we de-veloped out-of-sample predictions of ignition rate μi

for each observation, and estimates of mean absoluteerror (MAE), square root of the mean squared error(RMSE), mean error (ME), and error in the expectedtotal number of ignitions for the region (TE) and foreach prefecture P (TEP) Equations (3)–(7), where yi

is the observed number of ignitions in area unit i andnP is the set of observations i in prefecture P. Tominimize the effect of the particular fold sample, werepeated the cross- validation 140 times, each with a

different set of randomly generated folds and aver-aged the resulting 140 estimates of each error metric(Equations (3)–(7)) for each model.

MAE = 1n

n∑i

|yi − μi | (3)

RMSE =√√√√1

n

n∑i

(yi − μi )2 (4)

ME = 1n

n∑i

(yi − μi ) (5)

TE =n∑i

yi −n∑i

μi (6)

TEP =∑i∈nP

yi −∑i∈nP

μi ∀P (7)

For the ground motion data set, we also com-pared our models to four from the literature—Kawasumi, Mizuno, HAZUS, and Zhao. Using theform of each model presented in Zhao et al.,(14)

we refit the models with our ground motion dataset and computed the predictive errors. For Kawa-sumi, ln(yg/xh) is a linear function of ln(xpdam3). ForMizuno, ln(-ln(1-(yg/xh))) is a linear function of ln(1–xpdam3). For HAZUS and Zhao, yg/xh are second-order polynomial and linear functions of xpga, respec-tively. These analyses had to be done using the largerprefecture as the area unit rather than municipal-ity/ku, or there would be too many zeros. As a result,there were only 17 observations and we did leave-one-out rather than 10-fold cross-validation.

5. GROUND MOTION RESULTS

Each model allows prediction of the expectednumber of ignitions μi for each area unit i (and α forNB models) as a function of the attributes of the haz-ard intensity and built environment in the area unit.The μi (and α ) defines the Poisson (or NB) distri-bution that describes occurrence of ignitions in thatarea unit. Thus, each model can be used as a predic-tive tool, and to understand the relative importanceof variables contributing to ignitions. With these usesin mind, we first discuss the results of the preliminaryanalyses, and then compare the models accordingto: (1) predictive power (i.e., ability to predict bothtotal number of ignitions for the whole region and


geographic distribution of ignitions); (2) goodness-of-fit to the data; and (3) relative importance of co-variates. Finally, we discuss recommended modelsand their application as predictive tools.

5.1. Preliminary Analysis Results

For the GLMs, in the first step, instrumental in-tensity (xii) and area of commercial zoning (xcom)were clearly the preferred measures of ground mo-tion intensity and residential/commercial exposure,respectively. The best NB GLM with xpga underes-timated the total count by TE = 18.2 comparedwith TE = 4.1 for the best NB GLM with xii

(Table II). For GAMs, instrumental intensity (xii)and PGA (xpga) were both promising measures ofground motion intensity, and area of commercialzoning (xcom) and number of business establishments(xestab) were both promising measures of residen-tial/commercial exposure. A nonlinear relationshipis apparent between ignition rate and ground mo-tion with the marginal increase in ignition rate de-clining with ground motion (e.g., Fig. 6). Because theGLM cannot capture that nonlinearity, instrumentalintensity xii, which is defined to be more directly re-lated to damage and has a nonlinear relation withPGA, provides a better fit to the data.(42) BecauseGAMs can capture the nonlinearity, the choice be-tween xii and xpga is not as pronounced. The pref-erence of an indicator of commercial exposure (xcom

or xestab) over residential exposure (population, xpop,or area of residential zoning, xres) is also notable be-cause one might expect the latter to be preferredas most postearthquake ignitions occur in residen-tial buildings,(2) and xpop would be easier data to col-lect for application of the model. Although they arehighly correlated with xpop (ρ = 0.8 ), it appears thatxcom or xestab is consistently preferred at least in partbecause of three observations in Tokyo (Chiyoda,Minato, and Shinjuku), which have multiple ignitions(2, 7, and 1, respectively) and higher values of xcom

and xestab than expected given the population. Thus,xcom and xestab are better able to predict those countsthan xpop. In other regions, xpop may be equally ap-propriate.

In both GLMs and GAMs, although the choiceof covariates to represent the concepts of damagedbuildings and vulnerability of buildings to damagedid not matter greatly to the model fit, based oncomparison, number of houses with minor dam-age (xdam1) and percentage of houses that collapsed(xpdam3), respectively, were selected. In the second

step, we determined that the area of industrial zon-ing (xindus), building damage (xdam1), and buildingvulnerability to damage (xpdam3) covariates did notcontribute substantially to the model goodness-of-fit,and thus were not considered in the final models.

5.2. Predictive Power

Table II compares the most promising of eachmodel type according to five measures of predictivepower (defined in Section 4.2): RMSE, MAE, ME,TE, and E[|TEP |]. Comparing the RMSE, MAE, andME across models suggests that the average errors(RMSE and MAE) are slightly smaller for BRTsthan GLMs and GAMs, but the BRT predictions arebiased a bit low (with a mean error of 0.03 for themodels in Table II).

To gain more insight into the practical signif-icance of the errors, it is useful to examine themodels’ ability to predict the total numbers ofignitions, both for the entire study region and to getan idea of the models’ ability to capture the geo-graphic distribution—the total for each prefecture.Table II shows that although the BRTs had the bestRMSE and MAE values, they underestimate thetotal number of ignitions by almost 20. The PoissonGAMs, on the other hand, are within one ignition ofthe observed total of 147. Remembering that theseare out-of-sample predictions, the GLMs and GAMsprovide results that should be accurate enough to beof practical use. All four models from the literatureprovide substantially worse predictions, with errorsthat are two to six times larger (Table II).

Fig. 3 summarizes each model’s predictive errorsby prefecture, TEP. Table II includes the averageof absolute ignition count errors across prefectures(almost all are between 2 and 3). These resultssuggest that although the number of ignitions for theregion is predicted quite well, there is some errorin geographic distribution. In particular, the GLMshave somewhat higher prefecture errors than theother model types, and the BRTs underestimateprefecture counts more than overestimate. Thelargest errors are underestimation in Miyagi andIwate prefectures, and overestimation in Tochigi.

5.3. Goodness of Fit

Because the models are intended for use in pre-diction, predictive power is arguably more impor-tant than goodness of fit, and although we seek mod-els that fit the sample data well, we do not want to

10 Anderson et al.

Tab

leII

.Su

mm

ary

Com

pari

son

ofSe

lect

edM

ostP

rom

isin

gM

odel

sof

Eac

hM

odel

Typ

e,fo

rG

roun

dM

otio

nD

ata

Set

Roo

tA

vera

geM

ean

Mea

nof

Tot

alSq

uare

dA

bsol

ute

Mea

nT

otal

Abs

olut

eE

rror

Err

orE

rror

Err

orT

otal

Pre

fect

ure

Mor

an’s

Mod

el(R

MSE

)(M

AE

)(M

E)

(TE

)E

rror

%E

rror

sE

[ |TE

p|]

αR

2 αR

2 dev

UB

RE

/GC

VA

ICI

p-V

alue

For

mul

a

1N

BG

LM

0.68

90.

301

0.01

4.1

3%2.

80.

990.

820.

43–

557

0.36

x ii+

lxco

m:x

ii

2N

BG

LM

0.69

20.

302

0.01

4.6

3%2.

81.

000.

820.

430

557

0.24

x ii+

lxco

m

3N

BG

LM

0.69

60.

303

0.01

4.8

3%2.

90.

990.

820.

43–

559

0.28

x ii+

lxco

m+

x pw

ood

4P

GA

M0.

688

0.29

80.

000.

30%

2.5

––

0.46

–0.3

956

40.

97s(

x ii)

+s(

lxco

m)

5P

GA

M0.

691

0.29

80.

000.

30%

2.4

––

0.44

–0.3

557

90.

57s(

x ii,

lxco

m,x

pwoo

d)

6P

GA

M0.

693

0.29

90.

000.

60%

2.1

––

0.49

–0.3

855

30.

82s(

x ii)

+s(

lxco

m)+

s(x p

woo

d)

7N

BG

AM

0.67

00.

289

0.00

0.3

0%2.

50.

510.

900.

49–0

.50

542

0.61

s(x p

ga)+

s(lx

esta

b)+

s(x p

woo

d)+

s(x p

ga,l

x est

ab)+

s(x p

ga,l

x est

ab,

x pw

ood)

8N

BG

AM

0.68

00.

291

–0.0

1–3

.3–2

%2.

10.

510.

900.

48–0

.53

536

0.17

s(x p

ga)+

s(lx

esta

b)

9N

BG

AM

0.68

80.

298

0.00

1.6

1%2.

40.

840.

830.

46–0

.51

557

0.26

s(x i

i,lx

com

,xpw

ood)

10B

RT

0.66

30.

288

0.03

18.8

13%

3.2

––

0.63

––

–x p

ga+

lxco

m+

x pw

ood

+lx

indu

s11

BR

T0.

652

0.27

80.

0319

.113

%2.

4–

–0.

62–

––

x pga

+lx

esta

b+

lxin

dus+

x pda

m3

+x p

woo

d+

lxda

m1

12B

RT

0.68

20.

287

0.03

19.6

13%

2.6

––

0.63

––

–x i

i+

lxco

m+

x pw

ood

+lx

indu

s13

HA

ZU

S8.

75.

6–2

.6–4

4.4

–30%

5.6

––

––

––

x pga

(2nd

orde

rpo

lyno

mia

l)14

Zha

o9.

15.

9–3

.2–5

5.2

–38%

5.9

––

––

––

x pga

(lin

ear)

15K

awas

umi

16.8

12.4

–7.3

–124

.5–8

5%12

.4–

––

––

–ln

(xpd

am3)

(lin

ear)

16M

izun

o15

.811

.1–7

.6–1

29.9

–88%

11.1

––

––

––

ln(1

–xpd

am3)

(lin

ear)

Not

es:

For

all

smoo

ths

show

n,k=

20.

For

all

BR

Ts

show

n,tc

=2,

lr=0

.001

,an

dnt

=15,

000.

Obs

erve

dto

tal

num

ber

ofig

niti

ons=

147.

Six

left

-mos

tco

lum

nsco

mpu

ted

incr

oss-

valid

atio

n.Si

xri

ght-

mos

tcol

umns

com

pute

dba

sed

onfit

ting

mod

elw

ith

the

full

data

set.

R2

dev

for

Poi

sson

and

NB

mod

els

are

notd

irec

tly

com

para

ble.


Fig. 3. Errors in total number of ignitions TEp (observed-predicted) by prefecture and model, for ground motion models. Labeled columnlayers are for Miyagi prefecture. Each prefecture name includes the total number of observed ground motion ignitions.

Fig. 4. qq plots of deviance residuals for(a) Poisson GAM Model 5 and (b) NBGAM Model 9, for ground motion dataset.

overfit them. Nevertheless, goodness-of-fit measureswere useful in the preliminary analyses to iden-tify promising models for the CV analysis and canhelp identify model misspecification; thus we includethem (computed using the full data set) in Table II.First, we consider the assumption of a Poisson versusNB distribution for the ignition counts. A quantile–quantile (qq) plot of deviance residuals provides anindication of the appropriateness of the assumed dis-tribution. The closer the residuals are to the 45° line,the more appropriate the model. As an example, Fig.4 shows qq plots for Models 5 and 9, which are thesame except the former is a Poisson GAM and the

latter is an NB GAM. It suggests that the NB is amore appropriate assumption in this case, and thesame conclusion holds for most similar models wetested. The overdispersion parameter values α > 0suggest some preference for an NB model as well.Nevertheless, the predictive errors are larger for theNB GAM (TE = 4.5 compared with 0.3 for the Pois-son).

Second, in this analysis, we implicitly assume thatconditional on the covariate values, the observationsare independent. Because the observations are actu-ally derived from spatial data, we check the reason-ableness of that assumption by determining if there

12 Anderson et al.

is spatial correlation in the residuals. If there is, thatwould suggest we should use a more complicatedmodel that assumes spatially correlated residuals.(43)

For each model, we used the package {ape} in R tocompute the Moran’s I for the residuals.(44) Moran’sI is a commonly used measure of spatial autocorrela-tion. For each model, the p-value of the null hypothe-sis that there is no spatial autocorrelation is includedin Table II, showing in each case no evidence to sug-gest rejecting that hypothesis and moving to a morecomplicated model.

5.4. Relative Importance of Covariates

Refitting each model with the full set of data,we can examine how they compare in terms ofthe relative importance they assign to each conceptin the framework—ground motion, exposure, vul-nerability, and damage. For the BRTs, the {gbm}package provides a measure of the relative impor-tance of each covariate. It is “based on the numberof times a variable is selected for splitting, weightedby the squared improvement to the model as a re-sult of each split, and averaged over all trees.”(36) It isthen scaled so the contributions of all covariates sumto 100, with higher numbers indicating greater con-tributions. For each GLM and GAM, we estimatedthe relative importance of each covariate by comput-ing the range of the response μ over the observedrange of the covariate, then normalizing those valuesso that the normalized values over all covariates sumto 100. For the same models in Table II, Fig. 5 showsthe relative importance of each concept by model(when there were multiple covariates for a concept,we added their contributions). It shows that groundmotion and exposure are most important, with vul-nerability covariates never exceeding a contributionof 14 out of 100, and damage not contributing to mostmodels. For comparison, we included some mod-els with only a ground motion covariate in the CVanalysis, but found that although they could esti-mate the total ignition count well, they did not cap-ture the geographic distribution well. For example, aPoisson GAM with only the covariate xii had TE =−0.2, but the average absolute value of the prefec-ture error was E[|TEP|] = 4.2 with much higher er-rors for Tokyo, for example, which experienced rel-atively low ground motion intensity but has almosttwice the exposure as the next largest prefecture.Comparing Models 4 and 6 (Poisson GAMs withoutand with xpwood included) suggests that including thevulnerability covariate xpwood can reduce prefecture

errors from E[|TEP|] = 2.5 to 2.1. Looking at thesmooth for xpwood in Model 6, however, shows that itis not monotonically increasing, but rather “wiggles”up and down, which has no ready interpretation andmay indicate overfitting. The results also show thatthe NB GLM assigns a much higher contribution toground motion than exposure, whereas the reverse istrue for the GAMs and BRTs. This is likely relatedto the inability of the NB GLM to capture the non-linearity in the ground motion intensity.

5.5. Recommended Models

The results show that the best specific model de-pends to some extent on what one considers the mostimportant criteria. No one model is best across allmetrics and the models differ in ease of use as well.Although all model types are intended to be usedin a predictive mode, it is more straightforward forGLMs, which result in a closed- form expression,than GAMs or BRTs, which do not. Some specificmodels also require data for more covariates or forcovariates that are more difficult to measure consis-tently (e.g., percentage of buildings damaged). Nev-ertheless, taken together, the results suggest that aGAM provides the best predictive power and Model4 would be a reasonable choice. It achieves low pre-dictive errors (0.3 in the total number of ignitionsand an average of 2.5 for each prefecture). It is arelatively simple model with just two covariates, andthe smooths for Model 4 (Fig. 6) make sense intu-itively, with larger values of ground motion intensityand exposure leading monotonically to higher igni-tion rates. Model 4 would be straightforward to im-plement in that it requires data for only two covari-ates that are relatively easy to find. If one requiredthe closed- form equation provided by an NB GLM,Model 1 would be a reasonable choice, with finalform as given in Equation (8).

ln (μi )=−10.075 + 0.874xii +0.066xii ln(xcom) (8)

5.6. Future Application of Models

The selected model can be used to compute theexpected number of ignitions μi for each area uniti, and those μi can then be used for prediction fora specified historical or hypothetical earthquake sce-nario. To obtain the expected number of ignitionsfor the total area or a specified prefecture, one canjust sum the expected number of area unit ignitions


Fig. 5. Relative contribution of variablerepresenting each concept by model typeand number, for ground motion data set.

Fig. 6. Smooths for P GAM Model 4, for ground motion data set (y-axis labels include the estimated degrees of freedom of the smooth).

μi as we did in this study (Equations (6) and (7)).One can also use the Poisson or NB distributionwith those estimated parameters to simulate many(say, 1,000) realizations of ignition maps, each ofwhich can be used as input for fire spread modelsto estimate damage and losses. Summing the igni-tions in each map allows one to make a histogramof total number of ignitions, providing a descrip-tion of the variability around the expected total re-gional and prefecture ignition counts. Fig. 7 showsan example of two such histograms developed forModels 5 and 9 using the out-of-sample predic-tions for the Tohoku earthquake (and common ran-dom numbers to reduce sampling variability when

comparing models). Note that although both are cen-tered on the mean of 147 ignitions (the observedvalue), there is still a great deal of variability aroundthat expected value, so any one of the thousand re-alizations of ignition maps from the simulation mayhave many fewer or more than 147. Note also that,as expected, the assumption of an NB distributionresults in greater variability than Poisson (Equation(2)). This is consistent with the R2

dev and R2α values

(Table II), which suggest that although a large pro-portion of the variation in the true Poisson means isexplained by the models (high R2

α values), a muchsmaller proportion of the variation in the raw igni-tion counts yi is (lower R2

dev values).

14 Anderson et al.

Fig. 7. Histograms of total number of ignitions simulated usingout-of-sample predictions from Models 5 (Poisson GAM) and 9(NB GAM), for ground motion data set.

6. TSUNAMI RESULTS

For the tsunami data set, the preliminary GLMand GAM analyses suggested that xpga, xpgv, and xii

were candidates for best ground motion covariate.Both ln(xarea) and xdepth were potentially useful co-variates. For exposure, ln(xestab) again appeared togive the best fits, although for GAMs, the superior-ity was less pronounced than for the ground motiondata set, so ln(xpop) was considered as well. The pref-erence for commercial covariates was likely due tothe fact that a few area units had no residential zon-ing and very small population. Of the damage andvulnerability covariates, only xpdam3 appeared likelyto improve the fit substantially and was considered inthe CV analysis.

Table III compares some of the most promisingmodels of each model type for the tsunami-generatedignition data. Again, although the BRT average pre-diction errors (RMSE and MAE) are comparablysmall, they are biased low compared with the othermodel types with ME > 0.2. In terms of error in thepredicted total number of ignitions for the region,again, the BRTs are the worst, underestimating thetotal number of 119 by at least 15 ignitions (12%).Interestingly, although Poisson GAMs had excellentpredictive power for the ground motion models, theyperform relatively poorly for tsunami models. Thetwo P GAMs shown in Table III are the best of the13 included in the CV analysis, and they still overes-timate the total number of ignitions by 7.1 and 10.5

ignitions, and have the highest prefecture-level er-rors (see E[|TEp|] ). Qq plots of deviance residualsand α values (Table II) provide further evidence thatthe NB distribution is a better fit for the tsunami ig-nitions than the Poisson distribution.(17) ComparingNB GLMs and NB GAMs suggests comparable per-formance in terms of predictive power, so the sim-pler NB GLMs are preferred. For all model types, theMoran’s I values do not provide evidence for spatialcorrelation in the residuals for any models.

Fig. 8, which shows the errors in prefecture-level ignition counts, TEp, indicates that as withthe ground motion models, the tsunami models un-derestimate the number of ignitions in Miyagi andIwate prefectures, and overestimate the number inFukushima. This suggests the errors are not dueto misidentifying some ignitions as tsunami versusground motion generated, but rather that there maybe additional covariates that are important for deter-mining the number of ignitions that are not capturedin these models. Some ignitions in Fukushima alsomay not have been reported because many in thatarea evacuated due to the nuclear power plant threat.BRTs again tend to underestimate more than overes-timate prefecture-level ignition totals.

Finally, Fig. 9 shows the relative importance ofthe concepts in each tsunami model. In the bestmodel types (NB GLMs and NB GAMs), inundationcovariates are most important; followed by groundmotion covariates, which are about half as impor-tant; then exposure. This is in contrast to the groundmotion models, in which exposure was the most im-portant concept (Fig. 5). Comparing Models 1–4,for example, suggests that while including xdepth andln(xestab) reduce the RMSE and MAE, they do notimprove the prefecture-level errors.

Again, more than one model is promising de-pending on the relative importance afforded thedifferent criteria. Nevertheless, Model 1 could beone reasonable choice, offering a relatively simplemodel—an NB GLM with just two covariates (xpga

and ln(xarea)), and relatively small predictive errors(Equation (9)).

ln (μi )=−11.556 + 1.680xpga + 0.726 ln(xarea) (9)

The tsunami models can be used for predictingthe number and geographic distribution of ignitionsin any future, historical, or hypothetical tsunami. Ap-plication of the models in a predictive mode worksjust as it does for the ground motion models, as de-scribed in Section 5.6.


Tab

leII

I.Su

mm

ary

Com

pari

son

ofSe

lect

edM

ostP

rom

isin

gM

odel

sof

Eac

hM

odel

Typ

e,fo

rT

suna

miD

ata

Set

Roo

tM

ean

Mea

nA

vera

geSq

uare

dA

bsol

ute

Mea

nT

otal

ofT

otal

Err

orE

rror

Err

orE

rror

Tot

alE

rror

sE

[ |TE

p|]

Mor

an’s

Mod

el(R

MSE

)(M

AE

)(M

E)

(TE

)E

rror

%E

rror

sE

[|TE

p|]

αR

2 αR

2 dev

UB

RE

/GC

VA

ICV

alue

For

mul

a

1N

BG

LM

3.11

1.81

–0.0

2–1

.4–1

%3.

51.

080.

620.

44–

199

0.54

x pga

+lx

area

2N

BG

LM

3.11

1.82

0.00

–0.2

0%3.

51.

040.

630.

45–

200

0.63

x pga

+lx

area

+x d

epth

3N

BG

LM

2.99

1.72

0.02

1.3

1%3.

70.

920.

670.

47–

200

0.55

x pga

+lx

area

+x d

epth

+lx

esta

b4

NB

GL

M3.

041.

770.

010.

50%

3.8

1.02

0.64

0.44

–20

00.

53x p

ga+

lxar

ea+

lxes

tab

5P

GA

M3.

722.

03–0

.11

–7.1

–6%

9.6

––

0.47

1.69

124

80.

53s(

lxar

ea)+

s(lx

esta

b)

6P

GA

M3.

762.

02–0

.16

–10.

5–9

%10

.1–

–0.

531.

500

235

0.62

s(lx

area

)+

s(lx

pop)

7N

BG

AM

3.38

1.91

0.01

0.7

1%7.

81.

110.

610.

430.

115

199

0.86

s(lx

area

)+

s(lx

esta

b)

8N

BG

AM

3.92

1.96

–0.0

3–2

.0–2

%3.

30.

980.

650.

460.

110

196

0.56

s(x p

ga)+

s(lx

area

)9

NB

GA

M3.

851.

920.

010.

91%

3.8

0.90

0.68

0.47

0.19

219

70.

54s(

x pga

)+

s(lx

area

)+

s(lx

esta

b)

10N

BG

AM

3.23

1.87

0.09

5.5

5%4.

90.

830.

710.

480.

234

197

0.56

s(x p

ga)+

s(lx

area

)+

s(lx

pop)

11B

RT

3.46

1.82

0.24

15.1

12.6

%6.

4–

–0.

78–

––

x pgv

+lx

area

+x d

epth

+lx

esta

b+

x pda

m3

+lx

dam

112

BR

T3.

321.

760.

2717

.114

.4%

6.1

––

0.79

––

–x p

ga+

lxar

ea+

x dep

th+

lxes

tab

+x p

dam

3+

lxda

m1

Not

es:F

oral

lsm

ooth

ssh

own,

k=5.

For

allB

RT

ssh

own,

tc=3

,lr=

0.00

1,an

dnt

=15,

000.

Obs

erve

dto

taln

umbe

rof

igni

tion

s=11

9.Si

xle

ft-m

ostc

olum

nsco

mpu

ted

incr

oss-

valid

atio

n.Si

xri

ght-

mos

tcol

umns

com

pute

dba

sed

onfit

ting

mod

elw

ith

the

full

data

set.

R2 de

vfo

rP

oiss

onan

dN

Bm

odel

sar

eno

tdir

ectl

yco

mpa

rabl

e.

16 Anderson et al.

Fig. 8. Errors in total number of ignitionsTEP (observed-predicted) by prefectureand model, for tsunami data set. Labeledcolumn layers are for Miyagi prefecture.Each prefecture name includes the totalnumber of observed tsunami-ignitions.

Fig. 9. Relative contribution of variablerepresenting each concept by model typeand number, for tsunami data set.

7. CONCLUSIONS

In this article, we developed new ignition mod-els using data from the Tohoku earthquake andtsunami. We compared three different model types(GLMs, GAMs, and BRTs) for two data sets (groundmotion- and tsunami-generated ignitions). Results ofa cross-validation analysis show all models demon-strate predictive power far superior to those thatwere tested from the literature. In general, the BRTsresult in small mean errors, but are not as good asthe other model types in predicting the total numberof ignitions for the region. For the ground motiondata set, a GAM is recommended; for the tsunamidata set, an NB GLM or NB GAM is preferred. Out-of-sample predictions by the best models predictedthe total number of ignitions in the region withinone or two. At the prefecture level, however, errors

were greater (approximately three on average),underpredicting the number in Miyagi and Iwateprefectures in both cases, overpredicting in Tochigifor the ground motion data set and Fukushimafor the tsunami data set. The analysis suggestsexposure then ground motion had the greatestcontributions in the ground motion ignition models;and inundation, then ground motion in the tsunamiignition models. A nonlinear relationship was ap-parent between ignitions and ground motion, so forGLMs, which assume a linear response-covariate re-lationship, instrumental intensity was preferred overother possible ground motion covariates becauseit captures part of that nonlinearity. Measures ofcommercial exposure were preferred over measuresof residential exposure for both ground motion andtsunami ignition models. This may vary in otherstudy regions, but does highlight the value of testing


alternative measures for each concept. The modelswith the best predictive power included two or threecovariates. Those with just one were not able tocapture as much variability; those with more didnot improve the predictive ability and in some casesoverfit the data.

Though the proposed models are able to cap-ture a great deal of the variability in the mean num-ber of ignitions per municipality/ku, it is importantto remember that substantial Poisson (or NB) vari-ability remains in predicting the specific ignition mapthat will occur for a given earthquake or tsunami.Furthermore, although the data sets enjoy severalstrengths (Section 1), it is important to rememberthat they come entirely from a single earthquake inJapan and thus may not be applicable in other coun-tries and may not capture features associated withthe time of occurrence (day of the year or time ofthe day), which is thought to affect the occurrence ofpostearthquake ignitions.(2)

The analysis also suggests a few notes about theprocess. In any statistical analysis like this, it is im-portant to ensure that the proposed model makessense given what is known about the phenomenon—postearthquake ignitions in this case. Thus, the con-ceptual framework is a useful tool for guiding thechoice of covariates and ensuring that the contri-bution of each is reasonable in terms of parame-ter sign for GLMs or smooth shape for GAMs. Inparticular, although GAMs provide an excellent op-tion, care must be taken to ensure the smooths arenot overfitted. The analysis also shows the value ofcomparing models across multiple criteria—total re-gion and prefecture-level predictive error, ease of ap-plication and use, and goodness of fit to the sam-ple data. No single model performs best across all,and thus the choice will depend on the intendeduse.

In this article, we have applied two model-ing techniques—GAMs and BRTs—not previouslyused in this application area. Using a newly avail-able, unique data set, we have introduced new mod-els of groundmotion-generated ignitions for Japanthat are a substantial improvement over availablemodels, both in terms of the statistical methodsused and the demonstrated predictive power. Wehave also introduced the first statistical models oftsunami-generated ignitions. These new models canbe used for prediction of the number and geographicdistribution of ignitions in future earthquakes andtsunamis.

ACKNOWLEDGMENTS

The authors thank the National Science Foun-dation (CMMI-1138675) for financial support ofthis research, and also thank the Committee forPostearthquake Fire Research of the Japan Associ-ation for Fire Science and Engineering, which pro-vided ignition data. The author Himoto worked atKyoto University for part of the time during whichthe research was conducted.

REFERENCES

1. Murata A, Iwami T, Hokugo A, Murosaki Y. Mechanismof the outbreak of fire in the 1995 Hyogo-ken Nambuearthquake—In comparison with past earthquake fire cases.Journal of Architecture, Planning and Environmental Engi-neering, Architectural Institute of Japan, 2001; 548:1–8.

2. Scawthorn C, Eidinger J, Schiff A. Fire Following Earthquake,Technical Council on Lifeline Earthquake Engineering Mono-graph No. 26. Reston, VA: American Society of Civil Engi-neers, 2005.

3. Lee S, Davidson R, Ohnishi N, Scawthorn C. Fire follow-ing earthquake—Reviewing the state-of-the-art of modeling.Earthquake Spectra, 2008; 24(4):1–35.

4. Mohammadi J, Alyasin S, Bak D. Investigation of Cause andEffects of Fires Following the Loma Prieta Earthquake. IIT-CE-92-01. Chicago, IL: Illinois Institute of Technology CivilEngineering, 1992.

5. Williamson R, Groner N. Ignition of Fires Following Earth-quakes Associated with Natural Gas and Electric DistributionSystems. Berkeley, CA: Pacific Earthquake Engineering Re-search Center Report, University of California, 2000.

6. Zolfaghari MR, Peyghaleh E, Nasirzadeh G. Fire followingearthquake, infrastructure ignition modeling, Journal of FireSciences, 2009; 27:45–79.

7. Yildiz S, Karaman H. Post-earthquake ignition vulnerabilityassessment of Kucukcekmece District. Natural Hazards andEarth System Sciences, 2013; 13:3357–3368.

8. Kawasumi H. Examination of Earthquake-Fire Damage inTokyo Metropolis, Technical Report. Tokyo: Tokyo Fire De-partment, 1961.

9. Trifunac MD, Todorovaska MI. The Northridge, California,Earthquake of 1994: Fire ignition by strong shaking. Soil Dy-namics and Earthquake Engineering, 1998; 17:165–175.

10. Architectural Institute of Japan (AIJ). Report on theHanshin-Awaji Earthquake Disaster, Vol. 6. Tokyo, Japan:Maruzen Publishing, 1998 (in Japanese).

11. Himoto K, Yamada M, Nishino T. Analysis of ignitions follow-ing the 2011 Tohoku earthquake using Kawasumi model. Pro-ceedings of the 11th International Symposium on Fire SafetyScience, February 10—14, Christchurch, New Zealand, 2014.Available at: www.iafss.org/publications/fss/11/24.

12. Ren A, Xie X. The simulation of post-earthquake fire-pronearea based on GIS. Journal of Fire Sciences, 2004; 22(5):421–439.

13. Li J, Jiang J, Li M. Hazard analysis system of urban post-earthquake fire based in GIS. ActaSeismologicaSinica, 2001;14(4):448–455.

14. Zhao S, Xiong L, Ren A. A spatial-temporal stochastic sim-ulation of fire outbreaks following earthquake based on GIS.Journal of Fire Sciences, 2006; 24:313–339.

15. Cousins WJ, Smith W. Estimated losses due to post-earthquake fire in three New Zealand cities. Proceedings ofthe 2004 New Zealand Society of Earthquake Engineering

18 Anderson et al.

(NZSEE) Technical Conference, March 19–21, Rotorua, NewZealand, 2004. NZSEE, Paper 28.

16. Davidson R. Modeling post-earthquake fire ignitions usinggeneralized linear (mixed) models. Journal of InfrastructureSystems, 2009; 15(4):351–360.

17. Anderson D. Statistical models of post-earthquake igni-tions based on data from the Tohoku, Japan earthquakeand tsunami [master’s thesis]. Newark, DE: University ofDelaware, 2014.

18. Murata A, Hokugo A. Questionnaires survey on fires afterthe Great East Japan Earthquake 2011—Part 1: Summary ofquestionnaires survey and the cause of the fire. Kasai, 2013;63(1):1–6.

19. Baraldi A, Enders C. An introduction to modern missing dataanalyses. Journal of School Psychology, 2010; 48:5–37.

20. Hokugo A. Mechanism of tsunami fires after the Great EastJapan Earthquake 2011 and evacuation from the tsunami fires.Procedia Engineering, 2013;62:140–153.

21. Tanaka T. Characteristics and problems of fires following theGreat East Japan earthquake in March 2011. Fire Safety Jour-nal, 2012; 54:197–202.

22. Wald D, Worden B, Quitoriano V, Pankow K. ShakeMapManual: Technical Manual, Users’ Guide, and SoftwareGuide, version 1.0. Reston, VA: U.S. Geological Survey,2006.

23. Geospatial Information Authority of Japan (GIAJ). Provisionof Information on the 2011 Off the Pacific Coast of JapanEarthquake, 1/25000 Scale Inundation Area, 2011. Availableat: www.gsi.go.jp/kikaku/kikaku40014.html, Accessed April2012.

24. Japan Statistics Bureau, Ministry of Internal Affairsand Communications, 2009 and 2010. Available at:www.stat.go.jp/english/, Accessed November 2013.

25. Ministry of Land, Infrastructure, Transport and Tourism(MLIT), Housing and Land Survey, 2008. Available at:www.stat.go.jp/data/jyutaku/2008/, Accessed January 2013.

26. Ministry of Land, Infrastructure, Transport and Tourism(MLIT). National Land Numerical Information DownloadService, 2011. Available at: http://nlftp.mlit.go.jp/ksj-e/, Ac-cessed January 2013.

27. Fire and Disaster Management Agency (FDMA). 148thReport on the 2011 Off the Pacific Coast of TohokuEarthquake, 2013. Available at: www.fdma.go.jp/bn/higaihou/pdf/jishin/148.pdf, Accessed September 2013.

28. RCore Team. R: A Language and Environment for Statisti-cal Computing. Vienna, Austria: R Foundation for StatisticalComputing, 2013, Available at: www.R-project.org.

29. Venables WN, Ripley BD. Modern Applied Statistics with S,4th ed. New York: Springer, 2002.

30. Wood S. Generalized Additive Models: An Introduction withR. Boca Raton, FL: Chapmen & Hall, 2006.

31. Ridgeway G with contributions from others. GBM: Gen-eralized Boosted Regression Models. R package ver-sion 2.1-0.3, 2013. Available at: http://code.google.com/p/gradientboostedmodels/.

32. Ridgeway G. 2012. Generalized Boosted Models: A Guideto the GBM Package. R Package Vignette. Available at:http://CRAN.R-project.org/package=gbm.

33. McCullagh P, Nelder J. Generalized Linear Models, 2nd ed.London: Chapman & Hall, 1989.

34. Cameron A, Trivedi P. Regression Analysis of Count Data.Cambridge: Cambridge University, 1998.

35. Elith J, Graham CH, Anderson RP, Dudık M, FerrierS, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR,Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G,Moritz C, Nakamura M, Nakazawa Y, Overton J, PetersonAT, Phillips SJ, Richardson KS, Scachetti-Pereira R, SchapireRE, Soberon J, Williams S, Wisz MS, Zimmermann NE. Novelmethods improve prediction of species’ distributions from oc-currence data. Ecography, 2006; 29:129–151.

36. Elith J, Leathwick J, Hastie T. A working guide to boostedregression trees. Journal of Animal Ecology, 2008; 77:802–813.

37. Hastie T, Tibshirani R, Friedman JH. The Elements of Statis-tical Learning: Data Mining, Inference, and Prediction. NewYork: Springer-Verlag, 2001.

38. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classifica-tion and Regression Trees. Belmont, CA: Wadsworth Inter-national Group, 1984.

39. Friedman JH. Stochastic gradient boosting. ComputationalStatistics and Data Analysis, 2002; 38:367–378.

40. Cameron A, Windmeijer F. R-squared measures for countdata regression models with applications to health-care uti-lization. Journal of Business and Economic Statistics, 1996;14(2):209–220.

41. Akaike H. A new look at statistical model identification.IEEE Transactions on Automatic Control, 1974; AU-19:716–722.

42. Wald DJ, Quitoriano V, Heaton TH, Kanamori H. Rela-tionship between peak ground acceleration, peak ground ve-locity, and modified Mercalli intensity for earthquakes inCalifornia. Earthquake Spectra, 1999; 3 (15):557–564.

43. Liu H, Davidson R, Apanasovich T. Spatial generalized linearmixed models of electric power outages due to hurricanes andice storms. Reliability Engineering and System Safety, 2008;93(6):897–912.

44. Paradis E, Claude J, Strimmer K. APE: Analyses of phylo-genetics and evolution in R language. Bioinformatics, 2004;20:289–290.

Documents

Statistical Modeling of Fire Occurrence Using Data from the Tōhoku, Japan Earthquake ...scawthornporter.com/documents/Andersonetal2015... · 2019-05-24 · The 2011 Tohoku, Japan