Verifying high-resolution forecasts
Advanced Forecasting TechniquesForecast Evaluation and Decision Analysis METR 5803
Guest Lecture: Adam J. Clark 3 March 2011
Outline• Background
• Clark, A. J., W. A. Gallus, M. L. Weisman, 2010: Neighborhood-based verification of precipitation forecasts from convection-allowing NCAR-WRF model simulations and the operational NAM.
• Clark, A. J., and Coauthors, 2011: Probabilistic precipitation forecast skill as a function of ensemble size and spatial scale in a convection-allowing ensemble.
Why is verifying high-resolution forecasts challenging?
• Scale/predictability issues– As finer scales are resolved, time lengths for predictability
shorten. Was first shown mathematically by Ed Lorenz:• Lorenz, E. N., 1969: The predictability of a flow which possesses
many scales of motion. Tellus, 21, 289–307.
Challenges (cont.)
• Observations – many fields are not directly observed at high-resolution. Thus, models and data assimilation systems needed to create analysis datasets at high-resolution. Of course, this introduces uncertainties that should be accounted for when doing model evaluation.
• At fine scales, forecasts with “small errors” can still be extremely useful. For example…
Challenges (cont.)
Most subjective evaluations would say forecast #2 was “better”, although all objective metrics indicate forecast #1 was better. How can we develop metrics consistent with human subjective impressions?
OBS
#1
#2
Credit: Baldwin et al. 2001
“Non-traditional” approaches• Purely subjective – good, fair, poor, ugly; scale of 1 to 10, etc.
• Combination of subjective and objective– MCS example… manually categorize possible forecast outcomes into
2x2 contingency tables and then compute objective metrics
• Objective methods (Casati et al. 2008) has nice review.– Feature-based (Ebert and McBride 2000; Davis et al. 2006). Involves
defining attributes of “objects”– Scale Decomposition (e.g., Casati et al. 2004) – evaluates skill as
function of amplitude and spatial scale of the errors. – Neighborhood-based approaches (e.g., Ebert 2009) – consider values at
grid-points within a specified radius (i.e. “neighborhood” of an observation.
References so far…
• Casati, B., G. Ross, and D. B. Stephenson, 2004: A new intensity-scale approach for the verification of spatial precipitation forecasts. Meteor. Appl., 11, 141-154.
• Casati and Coauthors, 2008: Forecast verification: Current status and future directions. Meteor. Appl., 15, 3-18.
• Baldwin, M. E., S. Lakshmivarahan, and J. S. Kain, 2001: Verification of mesoscale features in NWP models. Preprints, Ninth Conf. on Mesoscale Processes, Fort Lauderdale, FL, Amer. Meteor. Soc., 255-258.
• Ebert, E. E., and J. L. McBride, 2000: Verification of precipitation in weather systems: Determination of systematic errors. J. Hydrol., 239, 179–202.
• Ebert, E. E., 2009: Neighborhood verification: A strategy for rewarding close forecasts. Wea. Forecasting, 24, 1498-1510.
• Davis, C. A., B. Brown, and R. Bullock, 2006: Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Mon. Wea. Rev., 134, 1785–1795.
Neighborhood ETS applied to CAMs
• What was done and why…– Examined a large set of convection-allowing forecasts run
by NCAR 2004-2008 and compared them to operational NAM.
- Earlier work (e.g., Done et al. 2004) showed skill scores between coarser models (~ 10 km) and convection-allowing NCAR-WRF almost identical.
- This contradicted other findings… Done et al. 2004; Fig 5
Neighborhood ETS (cont)• Perhaps lack of differences was simply a result of inadequate
“traditional metrics”.
• When forecasts contain fine-scale and high-amplitude features slight displacement errors cause “double penalties” – observed-but-not-forecast and forecast-but-not-observed
• Thus, many recent studies have developed alternative metrics– Feature based (Ebert and McBride 2000; Davis et al. 2006)– Scale-decomposition (Casati et al. 2004)– Neighborhood-based (Roberts and Lean 2008; Ebert 2009)
Neighborhood ETS (cont)• To compare the NCAR-WRF and NAM forecast a neighborhood-based
Equitable Threat Score (ETS) was developed.
• Traditional ETS is formulated in terms of contingency table elements:
and is interpreted as fraction of correctly predicted observed events, adjusted for hits associate with random chance.
• Neighborhood ETS is computed in terms of specified radii – this study uses 20 to 250 km.
Open circles – observed events
Crosses – forecast events
Grey shading - hits
r= 100 km, 81 total grid-points
Neighborhood ETS (cont)• To compute neighborhood ETS…
– Simply redefine contingency table elements in terms of r • Hits (traditional) – correct forecast of an event at a grid-point• Hits (neighborhood) – an event is forecast at a grid-point and observed at any
grid-point within r, or an event is observed at a grid-point and forecast at any grid-point within r.
• Misses (traditional) – event is observed at a grid-point but not forecast• Misses (neighborhood) – event is observed at a grid-point but not forecast at
any grid-point within r.
• False Alarms (traditional) – event is forecast at a grid-point but not observed• False Alarms (neighborhood) – event is forecast at a grid-point but not
observed at any grid-point within r.
• Correct negatives (traditional and neighborhood) – event is neither forecast nor observed at a grid-point
Neighborhood ETS (cont)
• Domain and cases
Neighborhood ETS (cont.)
• Model configurations…
Neighborhood ETS (cont.)• Results – 2004-2005
Neighborhood ETS (cont.)• Results – 2007-2008
Neighborhood ETS (cont.)• Time series at constant radii…
An aside: computing statistical significance for categorical metrics
• First, what are the ways to compute averages of categorical metrics over a set of cases?
– 1) Compute the metric for each case and then average over all cases.
– 2) Sum contingency table elements over all cases, and then compute the metric from the summed elements.
Hamill, T. M., 1999: Hypothesis tests for evaluation numerical precipitation forecasts. Wea. Forecasting, 14, 155-167.
• You have to be very careful computing statistical significance for precipitation forecasts! Key quote: “Precipitation forecast errors are often non-normally distributed and have spatially and/or temporally correlated error; the effective number of independent samples is much less than the total number of grid-points.”
• What does this mean??
• Hamill outlines several potential approaches for computing significance for threat scores. He finds a rather involved “resampling” approach is most appropriate for non-probabilistic forecasts.
ResamplingForecast 1 Forecast 2
Case 1 a, b, c, d a, b, c, d
Case 2 a, b, c, d a, b, c, d
Case 3 a, b, c, d a, b, c, d
Case 4 a, b, c, d a, b, c, d
Case 5 a, b, c, d a, b, c, d
Case 6 a, b, c, d a, b, c, d
Case 7 a, b, c, d a, b, c, d
ETS ETS
Difference between unshuffled ETS is test statistic
Forecast 1 Forecast 2
Case 1 a, b, c, d a, b, c, d
Case 2 a, b, c, d a, b, c, d
Case 3 a, b, c, d a, b, c, d
Case 4 a, b, c, d a, b, c, d
Case 5 a, b, c, d a, b, c, d
Case 6 a, b, c, d a, b, c, d
Case 7 a, b, c, d a, b, c, d
ETS ETS
Randomly shuffle the sets of contingency table elements for each case and recompute ETS difference. Repeat 1000 x. Construct a distribution of ETS differences and test whether test statistic falls within distribution. Don’t forget about bias!!
Neighborhood ETS: Compositing
• OFi is observed frequency for each grid-point within a conditional distribution of observed events.
• N is # of grid-points in domain• n is # of grid-points in radius-of-influence
Neighborhood ETS (cont): Compositing
• Composite observations relative to forecasts.
Neighborhood ETS (cont): compositing
Neighborhood ETS (cont)
• Composite animations from SE2010 SSEF members with different microphysics…
Neighborhood ETS (cont)
• Conclusions…– Computing ETS on raw grids gave small differences. When criteria for
hits was relaxed using “neighborhood” approach, more dramatic differences were seen that better reflected overall subjective impressions of forecasts.
Probabilistic Precipitation Forecast Skill as a function of ensemble size and spatial scale in a convection-allowing ensemble
Adam J. Clark – NOAA/National Severe Storms Laboratory
John S. Kain, David J. Stensrud, Ming Xue, Fanyou Kong, Michael C. Coniglio, Kevin W. Thomas, Yunheng Wang, Keith Brewster, Jidong Gao, Steven J. Weiss, and Jun Du
13 October 201025th Conference on Severe Local Storms, Denver, CO
Introduction/motivation• Basic idea
– Convection-allowing ensembles (~ 4km grid-spacing) provide added value relative to operational systems that parameterize convection. • Precipitation forecasts are improved, • diurnal rainfall cycle is better depicted, • ensemble dispersion is more representative of forecast uncertainty, and • explicitly simulated storm attributes can provide useful information on potential convection-
related hazards
– However, convection-allowing models are very computationally expensive relative to current operational “mesoscale” ensembles.
– EXAMPLE: Back of the envelope calculation for expense to make SREF convection-allowing…• 32 km to 4 km is decrease in grid-spacing by a factor of 8. To account for
3D increase in # of grid-points and time-step reduction, take 83 which gives increase in computational expense by factor of 500. Then take 500 times 21 members = 10,500.
Introduction/motivation (cont)
• Given computational expense, it would be useful to explore whether point of “diminishing returns” is reached with increasing ensemble size, n.
• 2009 Storm-scale ensemble forecast (SSEF) system provides good opportunity to examine the issue because it is relatively large – 20 members; 17 members used herein– Decent number of warm season cases (25).
Data and Methodology• 2009 SSEF system configuration
10 WRF-ARW members
8 WRF-NMM members
2 ARPS members
Data and Methodology (cont)• Domain and cases examined…
SSEF domain
analysis domain
Data and Methodology (cont)• 6-hr accumulated rainfall examined for forecast hour 6-30.
• Stage IV rainfall estimates used for rainfall observations.
• Forecast probabilities for rainfall exceeding 0.10-, 0.25-, 0.50-, and 1.00-in were computed by find the location of the verification threshold within the distribution of ensemble members.
• Area under the relative operating characteristic curve (ROC area) used for objective evaluation. – 1.0 is perfect, 0.5 is no skill, and below 0.50 is negative skill
• ROC areas were computed for 100 unique combinations of randomly selected ensemble members for n = 2 to 15. For n = 1, 16, and 17 all possible combinations of members used.
Data and Methodology (cont)– Examination of different spatial scales.
• Observed and forecast precipitation averaged over increasingly large spatial scales. Why do this? Not all end users require accuracy at 4-km scales…
- For the verification, constant quantiles were used to compare across different scales. Cannot compare constant thresholds because precipitation distribution changes.
Results
– Main result: More members are required to reach statistically indistinguishable ROC areas relative to the full ensemble as forecast lead time increases and spatial scale decreases (and ROC areas aren’t bad).
– Panels show ROC areas as function of n.– Different colors of shading correspond to different scales. Areas within each color show the inter-
quartile range for the distribution of ROC areas (recall, for each n, ROC areas were computed for 100 unique combinations of members).
– For each color, the darker shading denotes ROC areas that are less than those of the full 17 member ensemble with differences that are statistically significant.
RESULTS (cont)• Interpretation
– Rise in ROC area reflects gain in skill as forecast PDF is better sampled.
– More members are required to effectively sample a wider forecast PDF, so the n at which we see skill flatten should increase with a broadening PDF. Two variables in the analysis are associated with a widening of the PDF• 1) increasing forecast lead time (because model/analysis errors grow)• 2) decreasing spatial scale (because errors grow faster at smaller scales)
• Caveats – 1) Averages are presented. Specific cases with lower than average
predictability (higher spread) should require more members to reach point of diminishing returns. Low probability events require more members to achieve reliable forecasts (e.g. Richardson 2001).
RESULTS (cont)• More Caveats…
– Ensemble is under-dispersive.
- A reliable ensemble with more spread would require more members to effectively sample the forecast PDF.
CONCLUSIONS
• Spatial scale and forecast lead time needed for end users should be carefully considered in future convection-allowing ensemble designs.
• Future work needed to improve reliability of convection-allowing ensembles, and further evaluations are needed for weather regimes with varying degrees of predictability.
• QUESTIONS??