Creating a Serially Complete, National Daily Time Series of Temperature and Precipitation for the Western United States

1580 VOLUME 39J O U R N A L O F A P P L I E D M E T E O R O L O G Y

q 2000 American Meteorological Society

Creating a Serially Complete, National Daily Time Series of Temperature andPrecipitation for the Western United States

JON K. EISCHEID

Cooperative Institute for Research in Environmental Sciences, University of Colorado, Boulder, Colorado

PHIL A. PASTERIS

Natural Resources Conservation Service, Portland, Oregon

HENRY F. DIAZ

NOAA Climate Diagnostics Center, Boulder, Colorado

MARC S. PLANTICO AND NEAL J. LOTT

NOAA National Climatic Data Center, Asheville, North Carolina

(Manuscript received 10 November 1998, in final form 16 October 1999)

ABSTRACT

The development of serially complete (no missing values) daily maximum–minimum temperatures and totalprecipitation time series over the western United States is documented. Several estimation techniques based onspatial objective analysis schemes are used to estimate daily values, with the ‘‘best’’ estimate chosen as a missingvalue replacement. The development of a continuous and complete daily dataset will be useful in a variety ofmeteorological and hydrological research applications.

The spatial interpolation schemes are evaluated separately by interpolation method and calendar month. Crossvalidation of the results indicates a distinct seasonality to the efficiency (error) of the estimates, although nosystematic bias in the estimation procedures was found. The resulting number of serially complete daily timeseries for the western United States (all states west of the Mississippi River) includes 2034 maximum–minimumtemperature stations and 2962 total daily precipitation locations.

1. Introduction

There is great demand in the climate community andmany federal agencies for quality-controlled, seriallycomplete climate datasets for natural resource modeling.With many agencies increasingly relying on models todetermine the appropriate management decisions, theneed for accurate climate datasets for model validationand verification has never been greater.

Traditionally, model users must first identify, thenremove or correct extreme errors and/or missing data.These developers may or may not be knowledgeableabout the intricacies of the data being processed andoften develop algorithms to overcome data problemsthat may introduce additional uncertainties into the data.Uncoordinated and independent data correction and es-

Corresponding author address: Dr. Jon K. Eischeid, CIRES, Uni-versity of Colorado, Campus Box 449, Boulder, CO 80309-0449.E-mail: [email protected]

timation for specific regions or states may be redundant,expensive, and sometimes erroneous. The effect of miss-ing data, or data gaps, in the calculation of monthlymean temperature can result in errors that exhibit tem-poral and spatial patterns (Stooksbury et al. 1999). Thisproject has the objective to create serially complete dailydatasets in a systematic, well-documented fashion thatcan be utilized for many hydrologic and other naturalresource conservation models.

2. Project objective

The objective of this project is to create a seriallycomplete (no missing data values) daily temperature andprecipitation dataset (initially for the period 1951–91)for the United States in support of a wide variety ofecosystem resource models. The completed seriallycomplete daily dataset will be archived at the NationalClimatic Data Center (NCDC) in Asheville, North Car-olina, and will be made available to the Natural Re-sources Conservation Service (NRCS), U.S. Depart-

SEPTEMBER 2000 1581E I S C H E I D E T A L .

FIG. 1. (a) Geographical distribution of daily precipitation mea-surements designated as target stations (2962 stations). (b) Geograph-ical distribution of mean daily maximum–minimum temperature mea-surements designated as target stations (2034 stations).

TABLE 2. Frequency (%) of the number of times each of the in-terpolation methods was chosen as ‘‘best’’ for the two time categories.See section 3 for definitions of the method codes.

Method

NR IDW OI MLAD SBE MED

Category 1, A.M. readerJan precipitationJul precipitationJan max tempJul max tempJan min tempJul min temp

8.215.8

6.77.69.57.6

4.05.23.13.66.21.8

6.810.8

4.00.98.61.8

61.244.183.783.362.584.4

2.31.60.30.61.20.3

17.522.5

2.14.0

12.04.0

Category 2, P.M. readerJan precipitationJul precipitationJan max tempJul max tempJan min tempJul min temp

5.020.1

2.66.97.54.6

2.59.71.62.64.01.2

5.35.16.12.4

11.32.0

67.446.684.683.559.389.1

1.12.50.20.60.40.4

18.816.0

5.05.8

17.52.8

TABLE 1. An example of adjusting estimated data to accumulatedtotals (29999 denotes missing 24-h values).

Station identification number: 140471Adjustment ratio is 1.35 (1.80/1.33)

DateOriginal

estimate (in.)Adjusted

estimate (in.) Reported

25 Aug 197926 Aug 197927 Aug 197928 Aug 197929 Aug 1979Sum

0.030.140.240.000.921.33

0.040.190.320.001.251.80

29999299992999929999299991.80 in.

ment of Agriculture, and the climate community throughthe Unified Climate Data Access Network. The goal isto create serially complete datasets based on approxi-mately 4775 stations available from the NCDC Cli-matography of the United States No. 81—Monthly Sta-tion Normals, 1961–1990 (NCDC CLIM 81) (Owenbyand Ezell 1992). The purpose of this paper is to presentthe basic details/results of the creation of serially com-plete data for states west of the Mississippi River.

This project is funded by the NRCS National Waterand Climate Center located in Portland, Oregon, andcosponsored by the NCDC and the Climate DiagnosticsCenter located in Boulder, Colorado.

3. Project approach

The project uses a multistep approach to process ob-served daily precipitation totals and mean daily maxi-mum and minimum temperatures to create serially com-plete daily datasets. The creation of a serially completedataset includes the replacement of missing daily valuesthrough the use of simultaneous values at nearby sta-tions to calculate an estimated value for that particularday (all days in which an estimate is derived for a miss-ing value are flagged as such). Station histories are re-viewed and appropriate stations are selected and des-ignated as ‘‘target stations’’ for estimation. Data obser-vation times are reconciled and categorized to allowaccurate spatial interpolation from neighboring stationsdetermined to have sufficient record (greater than 10 yr)to provide stable estimation statistics with the targetstations.

Six different methods of spatial interpolation are usedto create the serially complete dataset. The methods aredefined as 1) the normal ratio method (NR), 2) simpleinverse distance weighting (IDW), 3) optimal interpo-lation (OI), 4) multiple regression using the least ab-solute deviation criterion (MLAD), 5) the single best


TABLE 3. Summary statistics for the monthly distribution of observed vs estimated correlation coefficients for maximum temperature. Ingeneral, the poorest estimates are found in the topographically diverse terrain of mountain regions as well as coastal areas of California,Oregon, and Washington.

Month

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Upper hingeMedianLower hinge

0.9750.9570.924

0.9760.9610.932

0.9750.9600.936

0.9720.9570.936

0.9670.9540.934

0.9610.9440.928

0.9460.9260.892

0.9500.9290.898

0.9660.9530.930

0.9700.9580.940

0.9740.9580.932

0.9730.9550.921

Number of stations 2034

FIG. 3. Geographical distribution of the observed vs estimated cor-relation coefficients for maximum temperatures for Jan. The scale ofcorrelation coefficients is inverted (i.e., low R to high R is plotted aslarge to small) to highlight the poorest correlations.

FIG. 2. Monthly distribution of the observed vs estimated correla-tion coefficients (R) for maximum and minimum temperatures.

estimator, and 6) the median (MED) of the previous fivemethods (Eischeid et al. 1995).

Statistical summaries are generated using cross cor-relations between observed daily values and those es-timated for each of the six different methods described.The six techniques respond to variations in season andgeography, and the best estimation method is selectedbased on the efficiency of the estimate over time. Thecross correlations are used to measure the efficiency ofeach method, and the method that exhibits the highestcorrelation relative to the other methods is utilized toreplace missing values. Performance in terms of theroot-mean-square (rms) statistic is also provided.

4. Dataset description and processing

The primary dataset used in this project was theNCDC Summary of Day (TD3200). Quality control per-formed by NCDC on this dataset included a procedure(Reek et al. 1992) that identified and flagged nearly400 000 data discrepancies (the number of discrepanciesis less than 1% of the total observations). To be included

in the serialization procedure, a station could not havemore than 48 missing months of data for the entireperiod of record (1951–91). A month was marked asmissing if it contained more than 14 consecutive daysof missing temperature or precipitation.

The 22 states west of the Mississippi River were ex-amined and 2962 precipitation, 2034 maximum, and2035 minimum temperature reporting stations were se-lected for serialization. The 2962 target precipitationstations are shown in Fig. 1a and the 2034 maximum–minimum target temperature stations are shown in Fig.1b. Although only target stations are shown here, manymore stations were used with less stringent serial re-quirements (i.e., 10 yr of record) in the estimation pro-cedures as a means of enhancing the pool of potentialpredictors. Also, stations from bordering states were ex-tracted to improve the spatial distribution of sites sur-rounding target stations located near state borders. Theresulting total number of stations, including target sta-tions, utilized for each of the estimation methods is 6353


FIG. 4. As in Fig. 3 but for Jul.

FIG. 5. Monthly distribution of the observed vs estimated rmse formaximum and minimum temperatures (8F).

TABLE 4. Summary statistics for the monthly distribution of the rmse between observed and estimated maximum temperatures (8F).

Month



3.963.322.73

3.703.142.61

3.583.032.54

3.522.922.50

3.282.752.35

3.042.562.18

2.832.442.12

2.942.522.19

3.302.792.40

3.462.942.49

3.683.082.58

3.883.252.71


precipitation and 4476–4463 maximum–minimum tem-perature stations.

The 22 states reflect a wide variety of terrain and adiversity of climatic regimes, which allows a means fortesting the efficacy of daily estimates for regional andseasonal differences. In addition, with few exceptions,the geographic distribution across the western states isrelatively uniform, which provides a stable estimationenvironment and a substantial number of serially com-plete stations for natural resource modeling.

Categorization of observation times

All estimation methods are dependent on a clear iden-tification of each station’s time of observation. Becauseobservation times within the Cooperative Network canvary from station to station (and sometimes within sta-tion over the period of record), it is important to de-termine and group stations correctly by their observationtime. The ‘‘true’’ observation time for a station can beinferred from the TD3200 dataset, although observationtime irregularities do exist and must be accounted for.The true observation times for the period 1967–81 wererecovered from the NCDC CLIM-81 dataset (Owenbyand Ezell 1992). Once corrected observation times were

determined, the following observation time categori-zations were made:

category 1: 0500–1100 LST—A.M. reader;category 2: 1200–2000 LST—P.M. reader;category 3: 2100–0400 LST—midnight reader.

Target station estimations were made using the ‘‘cor-responding’’ observation time category from neighborstations. After several target station estimation runs, itwas determined that category-3 stations provided an in-sufficient sample size to provide accurate estimates. Cat-egory-3 stations were then combined with category-2stations for the final runs. Implicit then in the followingestimation analyses, is that estimates are computedtwice: the ‘‘morning’’ and ‘‘evening’’ stations are doneseparately with stations in each category varying byparameter and month. Summary statistics characterizingthe quality of derived daily values are presented col-lectively for the target stations.

5. Estimation methodology

The replacement of missing daily values for maxi-mum–minimum temperatures and total precipitation in-


FIG. 6. Monthly distribution of the t statistic for maximum andminimum temperatures.

TABLE 5. Summary statistics for the monthly distribution of the t statistic between observed and estimated maximum temperatures.

Month



0.3420.1320.68

0.3420.0820.52

0.3320.0920.55

0.3620.0720.50

0.3820.1220.57

0.4520.1120.62

0.5320.1020.74

0.5520.0920.70

0.4120.0820.61

0.3120.1320.60

0.2520.1820.63

0.2720.1920.74


cludes the use of nearby simultaneous values to calculatean estimated value at the target station over the periodof time for which adequate data are available. The ef-ficiency, or accuracy, of the estimates over a long periodof time provides the information used to assess the qual-ity of estimated daily values. Estimated daily values areused in lieu of missing values as a means of making aparticular station serially complete.

There are numerous spatial interpolation methodsavailable for point estimation with irregularly spaceddata. Typically, the choice of methodology is dependenton several factors: the meteorological variable underconsideration, the geographical area, the spatial distri-bution of surrounding observations, and the day–month–season for which the target station is to be es-timated (Schlatter 1975; Bennet et al. 1984; Thiebauxand Pedder 1987).

Using regression-based methods, Kemp et al. (1983)found that the mean absolute error associated with min-imum temperatures was reduced by 50% when com-pared with within-station methods. Additional investi-gations performed by the Northeast Regional ClimateCenter (DeGaetano et al. 1993) have shown that re-

gression-based methods of data estimation tend to bemore accurate than within-station methods. Additionalwork (Huth and Nemesova 1995) has shown that otherweather elements, such as relative humidity, wind speed,and cloudiness, contribute very little to regression-basedmethods and that temperature at neighboring stationshas by far the highest spatial correlations.

DeGaetano goes on to mention that ‘‘while suchmethods are useful over limited areas, they are com-putationally intensive and therefore not feasible whendata estimates are needed for a large number of stationsover a long period of time’’ (DeGaetano et al. 1993).These limitations have been partially overcome with theuse of new high-speed workstations and large mass stor-age capabilities that now provide the horsepower re-quired to perform these intensive calculations in a rea-sonable time period.

Because estimates are required for each day sepa-rately over a variety of terrain with a differing numberof available surrounding observations, we have chosenseveral different methods for testing. This section willfocus on six methods of spatial interpolation. Each ofthe six methods are compared by month for each station,and the one with the highest correlation to the targetstation was chosen as the method used to replace anymissing daily values at that location (Eischeid et al.1995).

In any spatial interpolation scheme the selection andquantity of surrounding stations are critically importantto the results of the interpolations. Problems arise whenusing climatological data because of missing values andthe varying availability of stations through time. In orderto determine which stations are to be used, surroundingstations are preselected based on their relationship withthe target station. The 15 closest stations are identifiedfor each target station and are ranked by the value ofthe correlation coefficient between the candidate stationand its neighbors.

Correlation coefficients are computed for each monthseparately utilizing the daily data. The stations with thelargest positive correlation coefficients, where the min-imum criterion is an r of at least 0.35, are subsequentlyused in the estimation procedures. A minimum of onestation is needed to compute the estimate at the targetstation, with a maximum of four. Tests have shown thatinclusion of more than four stations does not signifi-cantly improve the interpolation and may in fact degradethe estimate.


TABLE 6. Summary statistics for the monthly distribution of observed vs estimated correlation coefficients for minimum temperature.

Month



0.9670.9480.911

0.9650.9440.906

0.9610.9350.884

0.9490.9180.863

0.9420.9120.859

0.9260.8950.846

0.9020.8560.785

0.9110.8700.810

0.9480.9160.867

0.9480.9160.877

0.9590.9360.896

0.9630.9420.902


FIG. 7. Geographical distribution of the observed vs estimated cor-relation coefficients for minimum temperatures for Jan. The scale ofcorrelation coefficients is inverted (i.e., low R to high R is plotted aslarge to small) to highlight the poorest correlations.

The number (never greater than four) of neighboringstations meeting the criteria is not fixed in time. It variesdepending on available station data for the year/month/day in question. As such, the interpolation models mayalso change in time. Moreover, the surrounding stationsthat may be optimal for a particular calendar month(e.g., January) may not be optimal for a different month(e.g., July). Thus, the station selection procedures arecomputed for each calendar month separately. The se-lected neighboring stations are identical for each of thesix interpolation methods. A brief description of eachmethod is given below.

a. Normal ratio method

The normal ratio (NR) method of spatial interpolationwas first proposed by Paulhus and Kohler (1952). Thecurrent analysis uses a modified version, which is de-scribed by Young (1992). Weights for the surroundingstations used in the estimation algorithm are found ac-cording to

2r (n 2 2)i iW 5 , (1)i 21 2 ri

where r is the correlation coefficient for each daily timeseries between the target station and the ith surroundingstation, n is the number of points used to derive thecorrelation coefficient, and Wi is the resultant weight.

b. Inverse distance method

The inverse distance method (IDW) is a simple dis-tance-weighted ‘‘area average’’ estimate of the value atthe target station. The assumption here is that surround-ing stations are related to the target station by theirproximity to the target station. This procedure is givenby

n

W ZO i ii51Z 5 , (2)n

WO ii51

where Zi is the particular monthly anomaly at the ithsurrounding station, and the weight function Wi is de-rived from the inverse of the distance from the targetstation to the ith surrounding station.

c. Optimal interpolation

Early uses of optimal interpolation (OI) in meteorol-ogy may be traced back to Gandin [see, e.g. Gandin andKagan (1974)]. Since that time it has had wide usagein climatology and meteorology. In most applicationsOI is used to estimate values at a target site, for example,a grid point. Here we use a univariate OI to estimatevalues at a known station location. Optimal interpolationis a spatial interpolation technique that assigns weightsto the observed difference values (observed minus firstguess) at the selected neighboring station locations,

n

Z 5 Z 1 W (Z 2 Z ), (3)Of i oi fii51

where Zoi and Zfi are the observed and first-guess valuesat the ith neighbor station, respectively, and Zf is thefirst-guess value at the target station location being es-timated. The first-guess values used in this particularapplication are concurrent values from the closest sta-tion with the highest correlation. The weighting coef-ficients Wi are determined in an objective manner suchthat the rms error (rmse) of the analyzed difference val-ues at each target location is minimized over the spatialdomain.


TABLE 7. Summary statistics for the monthly distribution of the rmse between observed and estimated minimum temperatures (8F).

Month



4.583.623.00

4.243.472.89

3.943.292.78

3.663.112.71

3.432.912.54

3.342.782.36

3.192.682.22

3.182.692.26

3.472.962.56

3.703.182.78

4.043.352.85

4.493.592.98



The weights are dependent upon the spatial autocor-relations among the surrounding observation values andare typically modeled mathematically as a function ofdistance separating the neighbors and the target location.Rather than model each spatial domain (the area sur-rounding each target station) individually, and to com-pensate for possible anisotropic fields, we use the actualrelationships among all observations by directly usingthe calculated correlation coefficients.

Once the correlation coefficients are known, theweights needed to solve Eq. (4) are given by the solutionof a system of linear algebraic equations. In matrix formthis can be written as

WiCir 5 Gr, (4)

where i 5 1, 2, . . . , N (number of surrounding obser-vations) whose coefficients are given by the correlationwith selected neighboring station observations (C, n 3n) and correlation coefficients from the target station toeach of the surrounding stations (G, 1 3 n vector).Because the empirical correlation coefficients are used,the solution matrix could fail to be positive definite, but,in practice, this was a rare occurrence.

d. Multiple regression, least absolute deviationscriteria

The method of multiple regression using the leastabsolute deviations criteria (MLAD) is a robust versionof the general linear least squares estimation. The meth-od of least squares is an effective method when theerrors are normally distributed and independent. How-ever, for precipitation data especially, the assumption ofnormality over the wide range of situations can lead topoor estimations. The principal advantage of least ab-solute deviations is its resistance to outliers and to over-emphasis of large-tailed distributions (Barrodale andRoberts 1973).

MLAD estimates the unknown parameters in a sto-chastic model so as to minimize the sum of absolutedeviations of the neighboring station observations fromthe values predicted by the model. Regression coeffi-cients b are calculated so as to minimize

x b 2 Y , (5)O O i j j i) )i j

where x, i 5 1, 2, . . . , m and j 5 1, 2, . . . , n denotea set of n measurements on m surrounding stations (in-dependent variables), and y, i 5 1, 2, . . . , k denote theassociated measurement on the dependent (target sta-tion) value. The linear programming techniques of Bar-rodale and Roberts (1973) are used to accomplish thistask.

e. Single best estimator

The single best estimator (SBE) is simple and anal-ogous to using the closest neighboring station as anestimate for the target station. The target station is es-timated using the actual observed value from the neigh-boring station that has the highest positive correlationwith the target station.

f. Median

The median method (MED) is not a true interpolationmodel but is simply the median value obtained from theabove five estimation methods. By using the median weimplicitly account for the estimation formula to changeover time, which may yield a better long-term estimate.


TABLE 8. Summary statistics for the monthly distribution of the t statistic between observed and estimated minimum temperatures. As aresult, a greater number of stations are found to be statistically significant at the 0.5% level. In the aggregate, the estimation of minimumdaily temperatures is less effective than that shown for daily maximum temperatures.

Month



0.890.00

20.91

0.700.00

20.66

0.6120.0320.76

0.5020.0320.70

0.5520.0320.68

0.5520.0620.66

0.740.00

20.85

0.7320.0320.83

0.5420.0420.76

0.5020.1220.82

0.6220.0420.72

0.8120.0620.97


FIG. 9. Monthly distribution of the observed vs estimated correla-tion coefficients (R) for precipitation.

6. Basic continuity checks

After estimating daily maximum and minimum tem-peratures, a series of internal consistency checks wereperformed to ensure that estimates did not violate ob-vious constraints associated with recording maximumand minimum temperatures. Typical tests include iden-tifying estimated maximum temperatures lower than aprevious day’s minimum and an estimated maximumlower than a minimum for the same day.

These inconsistencies were corrected by assigningcorrected maximums or minimums where appropriateor averaging the maximum and minimum temperaturesfor the previous and subsequent days.

Distribution of accumulated precipitation

One of the more vexing problems associated withcreating a serially complete daily precipitation time se-ries is the distribution of multiday (accumulated) pre-cipitation totals. These values are generally flagged withan ‘‘A’’ and do provide a valuable target value for theestimation procedure, because a primary goal of theestimation procedure is to not bias (neither increase ordecrease) the observed monthly precipitation total.

Several adjustment methods exist; however, the sim-plest and most computationally efficient was to estimateindependently the individual missing days and also theday shown with an accumulated amount, sum the es-timates for the same period, and take the ratio of theobserved accumulated to the sum of the estimatedamounts.

The ratio would be applied to the daily estimates forthe corresponding period to match the observed accu-mulated total flagged as A. The ratio method can in-crease or decrease the daily estimates and is constrainedby the observed accumulated precipitation total. Table1 illustrates the procedure for a station in Kansas forwhich the measured precipitation was reported as 1.80in. on 29 August as an accumulation of the previous 5days. The adjustment ratio is derived from the ratio ofthe measured total to the estimated total for the same 5days (in the example the estimated total is 1.33 in.).

7. Description of files created

The estimation procedure produces three files: 1) theoriginal data values, 2) the estimated values, and 3) themerged original values and estimates along with a flagthat indicates the method used to estimate the value.These files were analyzed to produce the statistics dis-cussed in the next section.

8. General results

In order to test the efficacy of the estimation tech-niques, each of the six interpolation methods is com-pared with respective nonmissing observations. Foreach station six estimates are computed for the entireperiod of record as if all observations were missing.Comparisons among the techniques are then based onthe correlation coefficient R between the original ob-served daily values and the corresponding estimated val-ues. At each station, for each month, six correlationcoefficients are calculated and ranked. The method thatexhibits the largest R is considered to be the most rep-resentative and is used to determine which estimates areused for further analysis in addition to replacement ofmissing values. In Table 2 we show these results, as apercentage of the total, for all stations in the study. Ingeneral, stations may, and quite often do, require theuse of different models depending on the month in ques-


TABLE 9. Summary statistics for the monthly distribution of observed vs estimated correlation coefficients for precipitation.

Month



0.9100.8570.783

0.9140.8640.787

0.9100.8610.787

0.9000.8570.792

0.8700.8260.758

0.8310.7820.707

0.8000.7160.615

0.8190.7470.638

0.8820.8330.765

0.9180.8790.817

0.9200.8730.804

0.9190.8670.794



FIG. 10. Geographical distribution of the observed vs estimatedcorrelation coefficients for daily precipitation for Jan. The scale ofcorrelation coefficients is inverted (i.e., low R to high R is plotted aslarge to small) to highlight the poorest correlations.

tion though it is clear that the MLAD interpolation meth-od outperforms the other five estimation techniques. Ad-ditionally, there is no apparent time-of-observation biaswith regard to method, that is, there was no significantdifference in the observed versus estimated error formorning times versus afternoon/evening times. A flagindicating which method was used to estimate the value,and the corresponding correlation coefficient, are pro-vided with the metadata. In general our concern washow well we were able to estimate daily temperatureand precipitation values. A description of the accuracy/efficiency of each of the methods is not presented here[see Eischeid et al. (1995) for an analysis of these meth-ods for monthly time series of temperature and precip-itation]. Rather, the following section outlines the ef-ficacy of the ‘‘best’’ estimate regardless of which esti-mating method was chosen. It should be noted that,although the results are presented on a monthly basis,the summary statistics are computed from daily values.

a. Maximum temperature

The efficiency of the best estimation is summarizedin Fig. 2 via the range of observed versus estimated

correlations found for all maximum temperature re-porting stations. The median correlation is above 0.90for all months ranging from a low of 0.93 in July andAugust to a high of 0.96 during the spring and autumntransition seasons (see Table 3). The spatial distributionof the maximum temperature correlations for all stationsis shown in Figs. 3 and 4 for January and July, respec-tively.

In general, the poorest estimates are found in the to-pographically diverse terrain of mountain regions aswell as coastal areas of California, Oregon, and Wash-ington. Conversely, the best estimates are obtained inareas of relatively even terrain where spatial relation-ships among surrounding stations (those used in the es-timation procedure) are stronger and cover a greaterareal extent than mountainous regions. In addition, theregions that exhibit the best correlations typically havea better, less scattered, spatial distribution of surround-ing stations.

The distributions of the rmse between the observedand estimated series are presented in Fig. 5 and Table4 for each month. For nearly all stations the rmse is lessthan 58F in all months with the lowest values shown forthe summer months (July median of 2.48F) and the larg-est during winter (January median of 3.38F). The geo-graphic areas of small/large rmse are generally analo-


FIG. 12. Monthly distribution of the observed vs estimated rmsefor precipitation (inches).

FIG. 13. Monthly distribution of the observed vs estimated meandaily percentage error (MDPE) for precipitation (%).

TABLE 10. Summary statistics for the monthly distribution of the rmse between observed and estimated precipitation (in.).

Month



0.1050.0600.039

0.1120.0630.043

0.1230.0840.060

0.1470.1000.072

0.2110.1440.084

0.2390.1850.085

0.2350.1880.089

0.2240.1730.090

0.2010.1290.077

0.1510.0930.063

0.1250.0770.054

0.1160.0660.043


gous to regions where the correlations are high/low (seeFigs. 3 and 4).

A simple t test was computed for observed and es-timated maximum temperature for the 2034 stations andthe range of monthly t values is summarized in Fig. 6and Table 5. The range of the t statistic is typicallylarger, based on previous results, for the summer monthsand smallest during winter. Although a number of sta-tions reveal significant differences, not unexpected giv-en the large sample size, the preponderance of stationsshowed no statistically significant difference among theobserved and estimated daily means. The spatial dis-tribution (not shown) of the t statistic at the 0.5% levelis spread evenly among the 22 states. No significant biaswas detected in any of the months with respect to over-or underestimating observed values.

b. Minimum temperature

Analysis of minimum daily temperatures producedresults broadly similar to the maximum daily temper-atures. For the most part the distributions of observedversus estimated correlations (Fig. 2, Table 6) exhibit agreater range in values relative to results shown formaximum temperatures (Tables 5–7).

The medians are also lower but follow the same pat-

tern of poorer estimates during summer months (Julymedian of 0.86) as compared with winter (January me-dian of 0.95). The geographical pattern of minimumtemperature correlations (Figs. 7 and 8) is analogous tothat described for the maximum temperatures. Here, thepoorer estimates for January are concentrated in a regionencompassing Arizona–New Mexico–Colorado with theeffect much more pronounced in July.

The monthly distributions of the rmse for the mini-mum daily temperatures shown in Fig. 5 (Table 7) par-allel the monthly pattern shown for the maximum tem-peratures. Although the seasonal pattern of lower/higherrmse for the summer/winter months may be the same,the range of values is typically larger. The patterns ofvalues for the rmse statistic for minimum temperaturesare consistent with those for maximum temperatures. Itshould be noted that although the spatial pattern of sta-tions is similar, the interquartile range (IR) of the rmsefor all minimum temperature stations has a larger am-plitude relative to identical values computed for max-imum temperature.

Examination shows the monthly distribution of the tstatistic for observed versus estimated minimum dailytemperatures is broadly consistent with the pattern formaximum temperatures, with the results describedabove (Fig. 6, Table 8). The IR is larger than that shown


TABLE 11. Summary statistics for the monthly distribution of the mean daily percentage error (MDPE) between observed and estimatedprecipitation (%).

Month



1.501.070.738

1.701.190.805

1.400.9430.644

1.430.9540.681

1.350.8710.667

1.711.140.854

2.081.451.09

2.051.471.07

1.701.250.932

1.491.090.799

1.511.070.759

1.521.030.699


for maximum temperatures although median values arenear zero.

As a result, a greater number of stations are found tobe statistically significant at the 0.5% level. In the ag-gregate, the estimation of minimum daily temperaturesis less effective than that shown for daily maximumtemperatures.

c. Total precipitation

As with temperature, the efficacy of the best estimatefor daily precipitation is assessed via the relationshipsbetween the observed and estimated daily values. Thedistribution of correlation coefficients, as an indicatorof the quality of the estimate for each station, is sum-marized in Fig. 9 and Table 9.

As expected, the range of values of R is much greaterthan that shown by maximum–minimum temperatures,and overall values are lower although the seasonal pat-tern remains the same: larger IR and comparatively low-er values during summer (July median of 0.72) as op-posed to the transition seasons and winter (January me-dian of 0.86). In general, the worst estimates are foundin the mountainous regions of the western United Statesfor both January and July (Figs. 10 and 11). As withmaximum–minimum temperature, the spatial pattern ofhigh–low correlations is much less pronounced in Jan-uary than for July. The topographical diversity and thelower density of stations in this region result in poorerestimations. Conversely, the best estimates are found inthe lower elevations of the data rich regions, for ex-ample, the western coasts and eastern Kansas–Nebraska.The accuracy of the daily precipitation estimate is de-pendent on the quality and quantity of the surroundingstations utilized to estimate a value at a particular site.The same is true for maximum–minimum temperaturesbut the determination of daily precipitation totals ismuch more sensitive to these factors, particularly withregard to the elevation of the site to be estimated.

Investigation of the rmse for observed versus esti-mated daily precipitation for all stations (Fig. 12, Table10) yields results consistent with the seasonal patternnoted previously. The median rmse is less than 0.1 in.for all months except May–September, with peak valuesin June (0.18 in.) and July (0.15 in.). In addition, theIR is much larger for the months May–September. Thelocation of stations that exhibit large/small rmse, rela-tive to all other stations, is not a reliable indicator of

the absolute quality of a station’s estimate, because ofdifferent climatic regimes. The pattern of large/smallrmse is a reflection of the regional climate and is pre-served in the observed/estimated differences. In otherwords, areas with climatically higher (lower) monthlyrainfall would typically have higher (lower) rmse val-ues. It was felt that a relative measure of precipitationestimation efficiency was needed. A statistic that wouldprovide the kind of information that the t test did forthe temperature field is described below.

The determination of whether the sample of estimateddaily precipitation values is significantly different fromobserved totals is problematic, particularly if a simplet test is used. Precipitation distributions generally ex-hibit positive skewness (values tend to cluster about themedian error rather than the mean) and thus do not lendthemselves well to parametric tests designed to testmean differences. For this reason, a simple ratio testwas employed to compare the observed versus estimatedprecipitation values and is described below.

At each station, for individual year/months the fol-lowing is computed:

n |E 2 O |O i i

i51 M 100, (6) N @

where in a series of N days for a particular year/monthEi represents the ith estimate for the month and Oi thecorresponding ith observation. The average absolute er-ror for the month is then normalized by the total pre-cipitation M for that month and expressed as a per-centage. The result is a simple estimate of the meandaily absolute error standardized to reflect the secularvariability of monthly precipitation totals for a widevariety of stations with differing seasonal cycles. Theyear/month ratios (expressed as percentage) are sum-marized for each station with the range of values for allprecipitation reporting stations presented in Fig. 13 andTable 11. These values will be referred to as the meandaily percentage error (MDPE).

The monthly distribution of the range of MDPEamong stations is consistent with that shown by therange of correlation coefficients (Fig. 9) and the seasonaldistribution of rmse (Fig. 12). Although, as expected,median MDPE values are higher for summer months(July median of 1.5%) than winter (January median of1.1%), the differences are small. Of note is the relative


similarity of the IR between months; this feature is notpresent for the seasonal range of correlation coefficientsand the rmse. The lack of a pronounced seasonal cycle,that is, the parameters of the monthly frequency distri-butions are close to each other, suggests that the MDPEstatistic is a more stable indicator of the efficiency ofthe estimate than that shown by the analysis of the cor-relation coefficient or the rmse.

9. Summary

This paper summarizes a set of procedures used tocreate serially complete daily temperature and precipi-tation datasets (1951–91) for the western United States.Determining target and estimator stations by scanningthe quality of individual station records, reconciling me-tadata (including observation times and station loca-tions), and categorizing observation times proved to betime consuming but necessary. Estimating the missingdata values and cross validating the results proved tobe relatively straightforward once preparatory work wascompleted.

Our results show that the efficacy of the estimationprocedure and thus the reliability of the estimated miss-ing values are dependent on a number of factors. Forall three meteorological parameters the selection andquantity of surrounding stations are critically importantto the results of the interpolations. We feel that the pre-selection of surrounding stations, based on their rela-tionship with the station to be estimated, is an integralfirst step.

The quality of the estimates is strongly affected byseasonality—more so for daily precipitation as opposedto maximum–minimum daily temperatures. This effectis most severe during the summer months. Because ofthe changing spatial relationships among the tempera-ture and precipitation fields, the derivation of the equa-tions for estimation should be done separately for theseason–month–day to be estimated. The quality of thedaily estimates also has a spatial component. Stationsat higher elevations are difficult to estimate accurately,in large part because of the topographical diversity ofthe surrounding stations leading to degradation of spa-tial coherence among stations.

In the aggregate, statistical tests of the resultant ob-served versus estimated time series show no systematicbias in the estimation procedures, particularly where theterrain and the density of surrounding stations is rela-tively uniform. In areas where the complexity of terrain(coastal, mountainous) dominates, the user should beaware of the limitations of the daily estimates. We feelthat our methods produce the best possible estimates forall stations for a variety of conditions, but the error or

bias at an individual site should be evaluated on a case-by-case basis. To assist the users in their own evaluationof the accuracy of individual estimates, the performancestatistics for each individual station series are availableas metadata with the serial dataset.

Acknowledgments. The authors acknowledge the con-tributions of Dick Cram, National Climatic Data Center,for his enlightened and tenacious review of the estimatesproduced by this project. We will miss his commitmentand knowledge in the field of climatology. The com-ments of the three official reviewers were valuable andconstructive.

REFERENCES

Barrodale, I., and F. D. K. Roberts, 1973: An improved algorithm fordiscrete L1 approximation. SIAM J. Numer. Anal., 10, 839–848.

Bennet, R. J., R. P. Haining, and D. A. Griffith, 1984: The problemof missing data on spatial surfaces. Ann. Assoc. Amer. Geogr.,74, 138–156.

DeGaetano, A. T., K. L. Eggleston, and W. W. Knapp, 1993: A methodto produce serially complete daily maximum and minimum tem-perature data for the Northeast. NRCC Research Publication RR93-2, 9 pp. [Available from NRCC, Cornell University, Ithaca,NY 14853.]

Eischeid, J. K., C. B. Baker, T. R. Karl, and H. F. Diaz, 1995: Thequality control of long-term climatological data using objectivedata analysis. J. Appl. Meteor., 34, 2787–2795.

Gandin, L. S., and R. L. Kagan, 1974: Construction of the system ofheterogeneous data objective analysis based on the method ofoptimal interpolation and optimal agreement (in Russian). Me-teor. Gidrol., 5, 3–10.

Huth, R., and I. Nemesova, 1995: Estimation of missing daily tem-peratures: Can a weather categorization improve its accuracy?J. Climate, 8, 1901–1916.

Kemp, W. P., D. G. Burnell, D. O. Everson, and A. J. Thomson, 1983:Estimating missing daily maximum and minimum temperatures.J. Climate Appl. Meteor., 22, 1587–1593.

Owenby, J. R., and D. S. Ezell, 1992: Monthly station normals oftemperature, precipitation, and heating and cooling degree days,1961–90. Climatography of the United States. No. 81, NationalClimatic Data Center. [Available from NCDC, Asheville, NC28801.]

Paulhus, J. L. H., and M. A. Kohler, 1952: Interpolation of missingprecipitation records. Mon. Wea. Rev., 80, 129–133.

Reek, T. S, S. R. Doty, and T. W. Owen, 1992: A deterministic ap-proach to validation of historical daily temperature and precip-itation data from the cooperative network. Bull. Amer. Meteor.Soc., 73, 753–762.

Schlatter, T. W., 1975: Some experiments with a multivariate objectiveanalysis scheme. Mon. Wea. Rev., 103, 246–257.

Stooksbury, D. E., C. D. Idso, and K. G. Hubbard, 1999: The effectsof data gaps on the calculated monthly mean maximum andminimum temperatures in the continental United States: A spatialand temporal study. J. Climate, 12, 1524–1533.

Thiebaux, H. J., and M. A. Pedder, 1987: Spatial Objective Analysiswith Applications in Atmospheric Science. Academic Press, 299pp.

Young, K. C., 1992: A three-way model for interpolating monthlyprecipitation values. Mon. Wea. Rev., 120, 2561–2569.

Documents

Creating a Serially Complete, National Daily Time Series of Temperature and Precipitation for the Western United States