Download ppt - Verifying high-resolution forecasts Advanced Forecasting Techniques Forecast Evaluation and Decision Analysis METR 5803 Guest Lecture: Adam J. Clark 3

Verifying high-resolution forecasts

Advanced Forecasting TechniquesForecast Evaluation and Decision Analysis METR 5803

Guest Lecture: Adam J. Clark 3 March 2011

Outline• Background

• Clark, A. J., W. A. Gallus, M. L. Weisman, 2010: Neighborhood-based verification of precipitation forecasts from convection-allowing NCAR-WRF model simulations and the operational NAM.

• Clark, A. J., and Coauthors, 2011: Probabilistic precipitation forecast skill as a function of ensemble size and spatial scale in a convection-allowing ensemble.

Why is verifying high-resolution forecasts challenging?

• Scale/predictability issues– As finer scales are resolved, time lengths for predictability

shorten. Was first shown mathematically by Ed Lorenz:• Lorenz, E. N., 1969: The predictability of a flow which possesses

many scales of motion. Tellus, 21, 289–307.

Challenges (cont.)

• Observations – many fields are not directly observed at high-resolution. Thus, models and data assimilation systems needed to create analysis datasets at high-resolution. Of course, this introduces uncertainties that should be accounted for when doing model evaluation.

• At fine scales, forecasts with “small errors” can still be extremely useful. For example…

Challenges (cont.)

Most subjective evaluations would say forecast #2 was “better”, although all objective metrics indicate forecast #1 was better. How can we develop metrics consistent with human subjective impressions?

OBS

#1

#2

Credit: Baldwin et al. 2001

“Non-traditional” approaches• Purely subjective – good, fair, poor, ugly; scale of 1 to 10, etc.

• Combination of subjective and objective– MCS example… manually categorize possible forecast outcomes into

2x2 contingency tables and then compute objective metrics

• Objective methods (Casati et al. 2008) has nice review.– Feature-based (Ebert and McBride 2000; Davis et al. 2006). Involves

defining attributes of “objects”– Scale Decomposition (e.g., Casati et al. 2004) – evaluates skill as

function of amplitude and spatial scale of the errors. – Neighborhood-based approaches (e.g., Ebert 2009) – consider values at

grid-points within a specified radius (i.e. “neighborhood” of an observation.

References so far…

• Casati, B., G. Ross, and D. B. Stephenson, 2004: A new intensity-scale approach for the verification of spatial precipitation forecasts. Meteor. Appl., 11, 141-154.

• Casati and Coauthors, 2008: Forecast verification: Current status and future directions. Meteor. Appl., 15, 3-18.

• Baldwin, M. E., S. Lakshmivarahan, and J. S. Kain, 2001: Verification of mesoscale features in NWP models. Preprints, Ninth Conf. on Mesoscale Processes, Fort Lauderdale, FL, Amer. Meteor. Soc., 255-258.

• Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors. J. Hydrol., 239, 179–202.

• Ebert, E. E., 2009: Neighborhood verification: A strategy for rewarding close forecasts. Wea. Forecasting, 24, 1498-1510.

• Davis, C. A., B. Brown, and R. Bullock, 2006: Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Mon. Wea. Rev., 134, 1785–1795.

Neighborhood ETS applied to CAMs

• What was done and why…– Examined a large set of convection-allowing forecasts run

by NCAR 2004-2008 and compared them to operational NAM.

- Earlier work (e.g., Done et al. 2004) showed skill scores between coarser models (~ 10 km) and convection-allowing NCAR-WRF almost identical.

- This contradicted other findings… Done et al. 2004; Fig 5

Neighborhood ETS (cont)• Perhaps lack of differences was simply a result of inadequate

“traditional metrics”.

• When forecasts contain fine-scale and high-amplitude features slight displacement errors cause “double penalties” – observed-but-not-forecast and forecast-but-not-observed

• Thus, many recent studies have developed alternative metrics– Feature based (Ebert and McBride 2000; Davis et al. 2006)– Scale-decomposition (Casati et al. 2004)– Neighborhood-based (Roberts and Lean 2008; Ebert 2009)

Neighborhood ETS (cont)• To compare the NCAR-WRF and NAM forecast a neighborhood-based

Equitable Threat Score (ETS) was developed.

• Traditional ETS is formulated in terms of contingency table elements:

and is interpreted as fraction of correctly predicted observed events, adjusted for hits associate with random chance.

• Neighborhood ETS is computed in terms of specified radii – this study uses 20 to 250 km.

Open circles – observed events

Crosses – forecast events

Grey shading - hits

r= 100 km, 81 total grid-points

Neighborhood ETS (cont)• To compute neighborhood ETS…

– Simply redefine contingency table elements in terms of r • Hits (traditional) – correct forecast of an event at a grid-point• Hits (neighborhood) – an event is forecast at a grid-point and observed at any

grid-point within r, or an event is observed at a grid-point and forecast at any grid-point within r.

• Misses (traditional) – event is observed at a grid-point but not forecast• Misses (neighborhood) – event is observed at a grid-point but not forecast at

any grid-point within r.

• False Alarms (traditional) – event is forecast at a grid-point but not observed• False Alarms (neighborhood) – event is forecast at a grid-point but not

observed at any grid-point within r.

• Correct negatives (traditional and neighborhood) – event is neither forecast nor observed at a grid-point

Neighborhood ETS (cont)

• Domain and cases

Neighborhood ETS (cont.)

• Model configurations…

Neighborhood ETS (cont.)• Results – 2004-2005

Neighborhood ETS (cont.)• Results – 2007-2008

Neighborhood ETS (cont.)• Time series at constant radii…

An aside: computing statistical significance for categorical metrics

• First, what are the ways to compute averages of categorical metrics over a set of cases?

– 1) Compute the metric for each case and then average over all cases.

– 2) Sum contingency table elements over all cases, and then compute the metric from the summed elements.

Hamill, T. M., 1999: Hypothesis tests for evaluation numerical precipitation forecasts. Wea. Forecasting, 14, 155-167.

• You have to be very careful computing statistical significance for precipitation forecasts! Key quote: “Precipitation forecast errors are often non-normally distributed and have spatially and/or temporally correlated error; the effective number of independent samples is much less than the total number of grid-points.”

• What does this mean??

• Hamill outlines several potential approaches for computing significance for threat scores. He finds a rather involved “resampling” approach is most appropriate for non-probabilistic forecasts.

ResamplingForecast 1 Forecast 2

Case 1 a, b, c, d a, b, c, d







ETS ETS

Difference between unshuffled ETS is test statistic

Forecast 1 Forecast 2








ETS ETS

Randomly shuffle the sets of contingency table elements for each case and recompute ETS difference. Repeat 1000 x. Construct a distribution of ETS differences and test whether test statistic falls within distribution. Don’t forget about bias!!

Neighborhood ETS: Compositing

• OFi is observed frequency for each grid-point within a conditional distribution of observed events.

• N is # of grid-points in domain• n is # of grid-points in radius-of-influence

Neighborhood ETS (cont): Compositing

• Composite observations relative to forecasts.

Neighborhood ETS (cont): compositing


• Composite animations from SE2010 SSEF members with different microphysics…


• Conclusions…– Computing ETS on raw grids gave small differences. When criteria for

hits was relaxed using “neighborhood” approach, more dramatic differences were seen that better reflected overall subjective impressions of forecasts.

Probabilistic Precipitation Forecast Skill as a function of ensemble size and spatial scale in a convection-allowing ensemble

Adam J. Clark – NOAA/National Severe Storms Laboratory

John S. Kain, David J. Stensrud, Ming Xue, Fanyou Kong, Michael C. Coniglio, Kevin W. Thomas, Yunheng Wang, Keith Brewster, Jidong Gao, Steven J. Weiss, and Jun Du

13 October 201025th Conference on Severe Local Storms, Denver, CO

Introduction/motivation• Basic idea

– Convection-allowing ensembles (~ 4km grid-spacing) provide added value relative to operational systems that parameterize convection. • Precipitation forecasts are improved, • diurnal rainfall cycle is better depicted, • ensemble dispersion is more representative of forecast uncertainty, and • explicitly simulated storm attributes can provide useful information on potential convection-

related hazards

– However, convection-allowing models are very computationally expensive relative to current operational “mesoscale” ensembles.

– EXAMPLE: Back of the envelope calculation for expense to make SREF convection-allowing…• 32 km to 4 km is decrease in grid-spacing by a factor of 8. To account for

3D increase in # of grid-points and time-step reduction, take 83 which gives increase in computational expense by factor of 500. Then take 500 times 21 members = 10,500.

Introduction/motivation (cont)

• Given computational expense, it would be useful to explore whether point of “diminishing returns” is reached with increasing ensemble size, n.

• 2009 Storm-scale ensemble forecast (SSEF) system provides good opportunity to examine the issue because it is relatively large – 20 members; 17 members used herein– Decent number of warm season cases (25).

Data and Methodology• 2009 SSEF system configuration

10 WRF-ARW members

8 WRF-NMM members

2 ARPS members

Data and Methodology (cont)• Domain and cases examined…

SSEF domain

analysis domain

Data and Methodology (cont)• 6-hr accumulated rainfall examined for forecast hour 6-30.

• Stage IV rainfall estimates used for rainfall observations.

• Forecast probabilities for rainfall exceeding 0.10-, 0.25-, 0.50-, and 1.00-in were computed by find the location of the verification threshold within the distribution of ensemble members.

• Area under the relative operating characteristic curve (ROC area) used for objective evaluation. – 1.0 is perfect, 0.5 is no skill, and below 0.50 is negative skill

• ROC areas were computed for 100 unique combinations of randomly selected ensemble members for n = 2 to 15. For n = 1, 16, and 17 all possible combinations of members used.

Data and Methodology (cont)– Examination of different spatial scales.

• Observed and forecast precipitation averaged over increasingly large spatial scales. Why do this? Not all end users require accuracy at 4-km scales…

- For the verification, constant quantiles were used to compare across different scales. Cannot compare constant thresholds because precipitation distribution changes.

Results

– Main result: More members are required to reach statistically indistinguishable ROC areas relative to the full ensemble as forecast lead time increases and spatial scale decreases (and ROC areas aren’t bad).

– Panels show ROC areas as function of n.– Different colors of shading correspond to different scales. Areas within each color show the inter-

quartile range for the distribution of ROC areas (recall, for each n, ROC areas were computed for 100 unique combinations of members).

– For each color, the darker shading denotes ROC areas that are less than those of the full 17 member ensemble with differences that are statistically significant.

RESULTS (cont)• Interpretation

– Rise in ROC area reflects gain in skill as forecast PDF is better sampled.

– More members are required to effectively sample a wider forecast PDF, so the n at which we see skill flatten should increase with a broadening PDF. Two variables in the analysis are associated with a widening of the PDF• 1) increasing forecast lead time (because model/analysis errors grow)• 2) decreasing spatial scale (because errors grow faster at smaller scales)

• Caveats – 1) Averages are presented. Specific cases with lower than average

predictability (higher spread) should require more members to reach point of diminishing returns. Low probability events require more members to achieve reliable forecasts (e.g. Richardson 2001).

RESULTS (cont)• More Caveats…

– Ensemble is under-dispersive.

- A reliable ensemble with more spread would require more members to effectively sample the forecast PDF.

CONCLUSIONS

• Spatial scale and forecast lead time needed for end users should be carefully considered in future convection-allowing ensemble designs.

• Future work needed to improve reliability of convection-allowing ensembles, and further evaluations are needed for weather regimes with varying degrees of predictability.

• QUESTIONS??