24
A Framework to Determine the Limits of Achievable Skill for Interannual to Decadal Climate Predictions Yiling Liu 1 , Markus G. Donat 1,2 , Andréa S. Taschetto 1 , Francisco J. DoblasReyes 3,2 , Lisa V. Alexander 1 , and Matthew H. England 1 1 Climate Change Research Centre and ARC Centre of Excellence for Climate Extremes, UNSW, Sydney, New South Wales, Australia, 2 Barcelona Supercomputing Center, Barcelona, Spain, 3 ICREA, Barcelona, Spain Abstract In interannual to decadal predictions, forecast quality may arise from the initial state of the system, from longterm changes due to external forcing such as the increase in greenhouse gases concentrations, and from internally generated variability in a model. In this study, we use a new framework to investigate achievable skill of decadal predictions by comparing perfectmodel prediction experiments with predictions of the real world in order to identify margins for possible improvements to prediction systems. In addition, we assess the added value from capturing the initial state in the climate system over changes due to climate forcing in decadal predictions focusing on annual average nearsurface temperature. We nd that ideal initialization may substantially improve the predictions during the rst two lead years particularly in parts of the Southern Ocean, Indian Ocean, the tropical Pacic and North Atlantic, and some surrounding land areas (the lead time is the elapsed time since the beginning of a prediction). On longer time scales, the predictions rely more on model performance in simulating lowfrequency variability and longterm changes due to external forcing. This framework identies the limits of predictability using the National Centre for Atmospheric Research Community Climate System Model Version 4 and claries the margins of achievable improvements from enhancing different components of the prediction system such as initialization, response to external forcing, and internal variability. We encourage similar experiments to be performed using other climate models, to better understand the dependence of predictability on the model used. 1. Introduction Regional climate is subject to substantial variability on interannual to decadal time scales, and the ability to provide useful climate predictions on these time scales could be benecial for numerous aspects of economic and social planning. The dominating uncertainty for simulations on the decadal time scale is contributed by internal climate variability, although model errors, future scenarios of greenhouse gas emissions, and other climate forcings such as aerosol and ozone can also play a role (Hawkins & Sutton, 2009). The eld of decadal prediction has seen rapid development and growth over the past decade (Goddard et al., 2013; Guemas et al., 2013; Marotzke et al., 2016; Meehl et al., 2014; van Oldenborgh et al., 2012; Yeager et al., 2018). Decadal climate predictions are achieved by climate model simulations initialized to the observed state of climate, and then run freely coupled for 10 years, incorporating multiple ensemble members to cover a range of possible realizations. Although these prediction systems show some skill for decadal predictions of largescale features (DoblasReyes et al., 2013), the skill seems to come primarily from changes in external forcing (DoblasReyes et al., 2013; van Oldenborgh et al., 2012). Studies based on stateofthe art decadal prediction systems have identied added value from initialization for the rst few forecast years (Marotzke et al., 2016; Yeager et al., 2018), but it remains unclear to what extent improving initialization may improve predictions on multiyear time scales in comparison to uninitialized historical simulations. It is therefore also unclear to what extent current prediction systems reach their achievable potential to provide useful information additional to general warming trends, for example, to users from specic sectors, and what level of skill may be achievable in the future. More broadly, it is possible that these current limitations in skill are due to imperfect initialization of the prediction runs and shortcomings in the model's representation of realworld climate variability or simply the intrinsic limits of predictability based on prescribed initial conditions. ©2019. American Geophysical Union. All Rights Reserved. RESEARCH ARTICLE 10.1029/2018JD029541 Key Points: A perfectmodel framework can be useful to determine the achievable skill in decadal predictions given ideal initialization There is widespread added value from initialization, compared to uninitialized simulations, in the rst two forecast years of predictions There are substantial margins to potentially improve realworld climate predictions by improving both initialization and climate models Supporting Information: Supporting Information S1 Correspondence to: Y. Liu, [email protected] Citation: Liu, Y., Donat, M. G., Taschetto, A. S., DoblasReyes, F. J., Alexander, L. V., & England, M. H. (2019). A framework to determine the limits of achievable skill for interannual to decadal climate predictions. Journal of Geophysical Research: Atmospheres, 124, 28822896. https://doi.org/10.1029/2018JD029541 Received 24 AUG 2018 Accepted 21 FEB 2019 Accepted article online 27 FEB 2019 Published online 15 MAR 2019 Author Contributions: Conceptualization: Yiling Liu, Markus G. Donat Funding acquisition: Markus G. Donat Methodology: Yiling Liu, Markus G. Donat Project administration: Markus G. Donat Resources: Markus G. Donat Software: Yiling Liu, Markus G. Donat, Andréa S. Taschetto Supervision: Markus G. Donat, Lisa V. Alexander, Matthew H. England Writing review & editing: Yiling Liu, Markus G. Donat, Andréa S. Taschetto, Francisco J. DoblasReyes, Lisa V. Alexander, Matthew H. England LIU ET AL. 2882

University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

A Framework to Determine the Limits of Achievable Skillfor Interannual to Decadal Climate PredictionsYiling Liu1 , Markus G. Donat1,2 , Andréa S. Taschetto1 , Francisco J. Doblas‐Reyes3,2,Lisa V. Alexander1 , and Matthew H. England1

1Climate Change Research Centre and ARC Centre of Excellence for Climate Extremes, UNSW, Sydney, New SouthWales, Australia, 2Barcelona Supercomputing Center, Barcelona, Spain, 3ICREA, Barcelona, Spain

Abstract In interannual to decadal predictions, forecast quality may arise from the initial state of thesystem, from long‐term changes due to external forcing such as the increase in greenhouse gasesconcentrations, and from internally generated variability in a model. In this study, we use a new frameworkto investigate achievable skill of decadal predictions by comparing perfect‐model prediction experimentswith predictions of the real world in order to identify margins for possible improvements to predictionsystems. In addition, we assess the added value from capturing the initial state in the climate system overchanges due to climate forcing in decadal predictions focusing on annual average near‐surface temperature.We find that ideal initialization may substantially improve the predictions during the first two lead yearsparticularly in parts of the Southern Ocean, Indian Ocean, the tropical Pacific and North Atlantic, and somesurrounding land areas (the lead time is the elapsed time since the beginning of a prediction). On longer timescales, the predictions rely more on model performance in simulating low‐frequency variability andlong‐term changes due to external forcing. This framework identifies the limits of predictability using theNational Centre for Atmospheric Research Community Climate System Model Version 4 and clarifies themargins of achievable improvements from enhancing different components of the prediction system such asinitialization, response to external forcing, and internal variability. We encourage similar experiments to beperformed using other climate models, to better understand the dependence of predictability on themodel used.

1. Introduction

Regional climate is subject to substantial variability on interannual to decadal time scales, and the abilityto provide useful climate predictions on these time scales could be beneficial for numerous aspects ofeconomic and social planning. The dominating uncertainty for simulations on the decadal time scale iscontributed by internal climate variability, although model errors, future scenarios of greenhouse gasemissions, and other climate forcings such as aerosol and ozone can also play a role (Hawkins &Sutton, 2009).

The field of decadal prediction has seen rapid development and growth over the past decade (Goddard et al.,2013; Guemas et al., 2013; Marotzke et al., 2016; Meehl et al., 2014; van Oldenborgh et al., 2012; Yeager et al.,2018). Decadal climate predictions are achieved by climate model simulations initialized to the observedstate of climate, and then run freely coupled for 10 years, incorporating multiple ensemble members to covera range of possible realizations. Although these prediction systems show some skill for decadal predictions oflarge‐scale features (Doblas‐Reyes et al., 2013), the skill seems to come primarily from changes in externalforcing (Doblas‐Reyes et al., 2013; van Oldenborgh et al., 2012). Studies based on state‐of‐the art decadalprediction systems have identified added value from initialization for the first few forecast years(Marotzke et al., 2016; Yeager et al., 2018), but it remains unclear to what extent improving initializationmay improve predictions on multiyear time scales in comparison to uninitialized historical simulations. Itis therefore also unclear to what extent current prediction systems reach their achievable potential to provideuseful information additional to general warming trends, for example, to users from specific sectors, andwhat level of skill may be achievable in the future. More broadly, it is possible that these current limitationsin skill are due to imperfect initialization of the prediction runs and shortcomings in the model'srepresentation of real‐world climate variability or simply the intrinsic limits of predictability based onprescribed initial conditions.

©2019. American Geophysical Union.All Rights Reserved.

RESEARCH ARTICLE10.1029/2018JD029541

Key Points:• A perfect‐model framework can be

useful to determine the achievableskill in decadal predictions givenideal initialization

• There is widespread added valuefrom initialization, compared touninitialized simulations, in the firsttwo forecast years of predictions

• There are substantial margins topotentially improve real‐worldclimate predictions by improvingboth initialization and climatemodels

Supporting Information:• Supporting Information S1

Correspondence to:Y. Liu,[email protected]

Citation:Liu, Y., Donat, M. G., Taschetto, A. S.,Doblas‐Reyes, F. J., Alexander, L. V., &England, M. H. (2019). A framework todetermine the limits of achievable skillfor interannual to decadal climatepredictions. Journal of GeophysicalResearch: Atmospheres, 124, 2882–2896.https://doi.org/10.1029/2018JD029541

Received 24 AUG 2018Accepted 21 FEB 2019Accepted article online 27 FEB 2019Published online 15 MAR 2019

Author Contributions:Conceptualization: Yiling Liu,Markus G. DonatFunding acquisition: Markus G.DonatMethodology: Yiling Liu, Markus G.DonatProject administration: Markus G.DonatResources: Markus G. DonatSoftware: Yiling Liu, Markus G.Donat, Andréa S. TaschettoSupervision:Markus G. Donat, Lisa V.Alexander, Matthew H. EnglandWriting – review & editing: YilingLiu, Markus G. Donat, Andréa S.Taschetto, Francisco J. Doblas‐Reyes,Lisa V. Alexander, MatthewH. England

LIU ET AL. 2882

Page 2: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

So‐called perfect model approaches have previously been used to study predictability (Branstator & Teng,2010; Collins, 2002; Collins et al., 2006; Griffies & Bryan, 1997; Grötzner et al., 1999; Pohlmann et al.,2004)–“a limit to the accuracy with which forecasting is possible” (Lorenz, 1969). These approaches circum-vent the inconsistencies that exist between models and real world by aiming to predict the climate of themodel used to make the predictions, rather than attempting to predict the real‐world climate itself. In parti-cular, these perfect‐model experiments can avoid the problems related to initializing the decadal predictionsto the real‐world (observed) climate; namely, imperfect initialization due to very limited observations andinconsistencies between the real world and the model climate. Both these factors can lead to initializationshock and climate drift (Mulholland et al., 2015; Sanchez‐Gomez et al., 2016; Teng et al., 2017). However,perfect‐model predictability has usually been investigated using different metrics (Branstator & Teng,2010; Collins et al., 2006; Pohlmann et al., 2004) than those used for the evaluation of real‐world predictions(Goddard et al., 2013). As such, previous investigations of perfect‐model predictability are not suitable todefine benchmarks for real‐world predictions.

We are aware of a few studies that have compared potential predictability with real‐world prediction skill(Boer et al., 2013; Doblas‐Reyes et al., 2011). In particular, Boer et al. (2013) systematically compared the pre-dictable variance based on the Canadian Centre for Climate Modelling and Analysis (CCCma) coupled cli-mate model with observations based on total variance and found potential improvement of decadalprediction skill over the Pacific, North Atlantic, and the Southern Oceans. However, we are not aware ofany other studies examining perfect model versus real‐world predictability using different models or meth-ods more commonly used to determine decadal prediction skill. In this study, we use the National Centre forAtmospheric Research (NCAR) Community Climate SystemModel version 4 (CCSM4; Gent et al., 2011) anda set of metrics recommended for the evaluation of decadal predictions (Goddard et al., 2013) to investigatethe potentially achievable improvement of decadal prediction skill from initialization and from modelenhancement respectively in a perfect model approach. Our objective here is to study potentially achievablelimits of prediction skill based on perfect‐model experiments with a comparable setup to real‐world predic-tions. We therefore assess predictability in a transient forced climate where both initial conditions and exter-nal forcing serve as potential sources of predictability.

We use the annual mean near‐surface temperature as an example to propose a framework for investigatingthe limits and potential improvements of interannual to decadal prediction skill. This framework can simi-larly be applied to other climate variables and will hopefully be used with other models to understand modeldifferences in predictability.

2. Data and Methodology2.1. The Model

We use the NCAR Community Earth System Model (CESM) version 1.0.5 for our experiments. CESM is ageneral circulation climate model consisting of atmosphere, land, ocean, and sea ice components that arelinked through a coupler. The configuration used for this study is the same as the one used to generatethe Coupled Model Intercomparison Project phase 5 (CMIP5; Taylor et al., 2012) historical andRepresentative Concentration Pathway (RCP) scenarios for the CCSM4 (Gent et al., 2011). The atmosphericcomponent is the Community Atmosphere Model (CAM4) with a resolution of 1.25° × 0.9° (latitude/long-itude; Neale et al., 2013). The ocean uses the Parallel Ocean Program version 2 (Smith et al., 2010), the landcomponent is the Community Land Model version 4 (Lawrence et al., 2011), and the sea ice component isthe Community Ice Code version 4 (Hunke & Lipscomb, 2008). The land and sea ice component share thesame horizontal grids as the atmosphere and ocean models, respectively. In this paper, we use the termCCSM4 model to refer to the simulations done in the CMIP5 project, whereas CESM refers to simulationsrun ourselves.

2.2. The Historical Reference Run

We first performed a CESM historical reference run by integrating a 20th‐century historical forcing simula-tion (a simulation that incorporates both anthropogenic forcings and natural external forcings) from 1850 to2005 and then applying the RCP4.5 scenario from 1 January of 2006 to the end of 2015. Choosing RCP4.5 asthe scenario for 2006 to 2015 is somewhat arbitrary, but the different RCP scenarios are very similar duringthe first decade and therefore no significant differences would be expected if using another scenario. Our

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2883

Page 3: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

CESM historical reference uses exactly the same setup and initial state as the r3i1p1 member of the CCSM4historical simulations (see http://www.cesm.ucar.edu/CMIP5/forcing_information/ for more informationabout CCSM4 r3i1p1). The reason that we ran our own CESM historical simulation to generate the initiali-zation data for the decadal prediction ensemble members (rather than use the ready‐to‐use CCSM4 histor-ical run as the initial data of decadal ensemble members) is to avoid any round‐off error difference thatcan arise across different computational platforms. Numerical round‐off error effectively adds an additionalperturbation to the system. Furthermore, we required annual restart files to initialize the perfect model pre-dictions (see section 2.3), and these were written out by our historical reference run.

2.3. The Initialized Perfect Model Predictions

The initialized perfect model predictions are initialized from every 1 January in the 46 different years from1961 to 2006 using annual restart files from the historical reference run (section 2.2). Five ensemble mem-bers are generated for these decadal simulations by adding small Gaussian perturbations of order 10−5 K(i.e., smaller than possible to measure in real‐world observations) to every grid point of the near‐surface tem-perature field of the CESM historical reference run, and these perfect‐model decadal predictions are thenintegrated over 10 years. The analysis is based on decade‐long coupled experiments of 46 starting yearsand five ensemble members for each starting year.

An alternative method to generate ensembles for initialized decadal predictions is a so‐called lagged initia-lization, meaning that the model initial atmospheric state for the simulations starting on 1 January is takenfrom the state a few days earlier or later (Marotzke et al., 2016; Pohlmann et al., 2013). We tested bothapproaches but the different simulations developed similarly and there was no difference in near‐surfaceair temperature correlations beyond the first two lead months as also reported by Hawkins et al. (2016).We therefore used the small temperature perturbations for the prediction experiment, as this is in line withour concept to investigate predictability from ideal (almost identical) initial states. Another approach to gen-erate decadal prediction ensembles uses different seeds to a stochastic physics scheme (Dunstone et al., 2016;MacLachlan et al., 2015); it is however unclear in how far such perturbations to the model physics would besuitable in a perfect‐model context.

2.4. The Initialized Real‐World Decadal Hindcasts

The initialized real‐world decadal hindcasts of CCSM4 (Yeager et al., 2012) are part of the CMIP5 protocol(referred to as CCSM4 decadal hindcasts in the CMIP5 archive; Taylor et al., 2012). These hindcasts con-sist of retrospectively initialized simulations starting from 1 January of 14 different start years (1961, 1966,1971, …, 1996, 2001, 2002, 2003, 2004, 2005, and 2006). Simulations from each start date include 10ensemble members integrated through at least 10 years. Note that only ocean and sea ice componentsare initialized from a CORE forcing experiment, that is, a CCSM4 ocean/sea ice stand‐alone simulationforced with the NCEP/NCAR atmospheric reanalysis data. The atmosphere and land components areused to create ensemble members by randomly choosing atmosphere and land initial conditions from dif-ferent 20th century runs and/or from different days in December. We limit our analysis to the first10 years and five ensemble members of the decadal hindcasts runs, to be consistent with our perfectmodel prediction experiment.

2.5. The Uninitialized Climate Predictions

CCSM4 historical simulations starting from different states of a preindustrial control run are used in thisstudy as ensembles of uninitialized climate predictions. These characterize the predictability (of both thereal world and the historical reference climate simulation) based on only changes in external forcing (i.e.,greenhouse gas concentrations, ozone, and aerosols). There are six historical simulations available withinCMIP5 (r1i1p1–r6i1p1), but we only use five of them (r1i1p1, r2i1p1, r4i1p1, r5i1p1, and r6i1p1) becauser3i1p1 has the same starting conditions as our reference run. This also ensures we analyze the same numberof ensemble members as in our initialized perfect model predictions. We use data from the CCSM4 historicalsimulations before 1 January of 2006, whereas from 1 January of 2006 to 31 December of 2015 data from theRCP4.5 scenario runs are used. These uninitialized simulations serve to quantify the predictability fromexternal forcings. In comparison to the initialized runs (sections 2.3 and 2.4) these runs help to identifythe added value of initialization, as the difference in skill between the initialized and uninitialized runscan be attributed to initializing the predictions.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2884

Page 4: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

2.6. Skill Metrics

To evaluate the forecast quality of the decadal predictions (both perfect‐model and real‐world), we follow therecommendations for decadal predictions outlined by Goddard et al. (2013) and implemented in the Miklipevaluation system (Illing et al., 2014). To evaluate the prediction accuracy, the primary deterministic verifi-cation metric is given as the mean squared skill score (MSSS; also termed mean squared error skill score,MSESS). MSSS can be decomposed as the following equation when the climatological forecast is used asreference:

MSESS H;O; O� � ¼ r2HO− rHO−

SHSO

� �2−

H−OSO

� �2(1)

with rHO being the sample correlation between the hindcasts and the observations, SH2 the sample variance

of the ensemble mean hindcasts, SO2 the sample variance of the observations, O ¼ ∑n

j¼1Oj is the climatolo-gical forecast (of whichOj represents the observations or perfect‐model reference respectively, over j= 1,..., nstart times), andH is the mean hindcast (Murphy, 1988; Murphy & Epstein, 1989). Here MSSS is expressed asa function of the hindcasts, climatological forecast, and observations. Note that the second term on the right‐hand side of this equation denotes the conditional prediction bias and the last term represents the uncondi-tional bias (Kadow et al., 2016). Conditional prediction bias refers to cases where the bias changes with time,with prediction, and (observational or perfect‐model) reference exhibiting different rates of change.Unconditional bias refers to differences in the mean states.

To measure the ensemble dispersion (or ensemble spread), we use the logarithmic ensemble spread score(LESS; Kadow et al., 2016), which is defined as the natural logarithm of the result of the average ensemblespread (ensemble variance) divided by the reference mean square error. In other words,

LESS ¼ lnσ2bHσ2R

0@

1A (2)

of which,

σ2bH ¼ 1n∑n

j¼11

m−1∑m

i¼1bHij−bHj

� �2(3)

σ2R ¼ 1n−2

∑nj¼1

bHj−Oj

� �2(4)

where σ2bH is the average ensemble variance and σ2R is the reference mean square error. The initialized hind-casts Hij consist of their ensemble members i = 1, …, m and the start times j = 1, … , n. Hj is the hindcastensemble mean, with the ensemble members bHij and the ensemble mean bHj corrected for mean and condi-tional bias. Oj is the observations over j = 1, … , n start times. If LESS is <0, the ensemble is underdispersive(i.e., overconfident), and if LESS is >0, the ensemble is overdispersive (i.e., underconfident). Note that due tothe small sample, the variance calculation may not be robust, but still, despite being subject to uncertainty,this analysis should give a general indication of the ensemble behavior.

We use 200 bootstrap realizations (with replacement) to estimate the significance of MSSS, LESS, and thedifference between different skills. Each bootstrap realization is a different combination of starting years,whereby the resampling maintains continuity in 5‐year blocks to preserve temporal autocorrelation relatedto low‐frequency variability (Goddard et al., 2013). The resampling is performed on the hindcast ensemblemean, and significance is assessed at the p ≤ 0.05 (two‐tailed) level. To determine whether skill differencesbetween different experimental setups are significantly different from 0, the same bootstrapping sequence isapplied to the pairs of skill metrics. The differences are considered significantly different from 0 if the zerodifference lies outside the 2.5th to 97.5th percentile. Results based on 200 bootstrap samples were found to bereasonably robust in comparison to, for example, 1,000 bootstrap samples. We therefore chose to use 200bootstrap samples to estimate significance in this study, as it appears a good compromise between requiredcomputing and storage resources and robustness of the significance estimates.

To minimize effects from small‐scale noise, the model temperature data are all remapped to 10° × 10° gridprior to calculating skill measures, as recommended by Goddard et al. (2013).

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2885

Page 5: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

2.7. Framework to Compare Predictability and Margins of Possible Improvements

Two different references are used in this paper to assess the prediction skills:

Ref1. Near‐surface temperature observations: We use three different gridded observational temperaturedata sets to evaluate real‐world prediction skill. In the main text, this is taken from the HadleyCentre/Climate Research Unit Temperature version 4 (HadCRUT4) climatology; a data set of globaland regional temperature evolution from 1850 to the present on a monthly grid of 5° latitude by 5°longitude. HadCRUT4 comes as 100 realizations for error sampling (Morice et al., 2012). We usethe HadCRUT4 median as the observational reference (i.e., for real‐world predictions) in our study.Accounting for uncertainty and errors in observational data sets, we ensure the robustness of the skillestimates to choice of observational by also evaluating real‐world hindcast skill against NationalOceanic and Atmospheric Administration's Merged Land‐Ocean Surface Temperature Analysis(MLOST; Smith et al., 2008; Vose et al., 2012; Figures S4 and S5 in the supporting information) andthe Goddard Institute for Space Studies surface temperature analysis (GISTEMP; Hansen et al.,2010); these are included in the supporting information.

Ref2. Perfect model temperature reference: we use our CESM historical (until 2005) and RCP4.5 (2006–2015) run (i.e., the simulation used to provide the initial conditions for the perfect‐model decadalsimulations) as the reference to evaluate the skill of the CESM perfect model simulations startedbetween 1961 and 2005.

The following four different skills are calculated by combining different predictions and references (see alsoTable 1).

1. The prediction skill of the initialized perfect model: We use the perfect model temperature reference(Ref2) as reference to assess the prediction skill of the initialized perfect model predictions. The modelis assumed to be consistent with its evaluation reference as the reference is from the same model asthe predictions, which is why this is called perfect model. This setup therefore provides an estimate ofwhat the predictability is in the CESM model world given ideal initialization.

2. The prediction skill of the uninitialized perfect model: Here we evaluate the set of uninitialized simula-tions available from the CMIP5 archive regarding their skill to predict our reference run (Ref2). As thereis no common initial state, all the skill of these runs would come from changes in external forcing. Theskill assessment is also based on all years from 1961 to 2006.

3. The skill of real‐world decadal hindcasts: The skill of real‐world decadal hindcasts is assessed by compar-ing CCSM4 decadal hindcasts with the observational temperature reference (Ref1, e.g., the HadCRUT4median). This determines the real‐world decadal prediction skill of the CMIP5 CCSM4 decadal experi-ments, only including simulations of the 14 available start years from 1961,1966, …, 1996, 2001, 2002,…, 2006. Note that this is different to the above cases (i) and (ii) where predictions started from every yearfrom 1961 to 2006. Comparable figures for these other cases, using only the 14 initial years for which real‐world hindcasts are available, are included in the supporting information.

4. The skill of uninitialized real‐world climate predictions: This skill test uses the different CCSM4 histor-ical simulations (r1i1p1, r2i1p1, r4i1p1, r5i1p1, and r6i1p1) as the ensemble members to compare withthe observational temperature reference (Ref1). Just like the uninitialized perfect model case, the skillassessment is also based on the 46 start years 1961 to 2006. This will reveal how external forcings caninfluence decadal predictions.

We then compare three pairs of skills from above to examine the potential improvements from perfect initi-alization (i minus ii), model–real‐world inconsistencies (ii minus iv) and the overall improvement margin of

Table 1Four Different Skills Assessed by Combining the Different Predictions With the Different Reference Fields

Temperature referencesThe initialized perfect modelpredictions (as in section 2.3)

The initialized real‐world decadalhindcasts (as in section 2.4)

The uninitialized climatepredictions (as in section 2.5)

Near‐surface temperature observations (Ref1) The skill of real‐world decadalhindcasts (iii)

The skill of the uninitializedreal‐world climate predictions (iv)

Perfect model temperature reference (Ref2) The prediction skill ofinitialized perfect model (i)

The prediction skill of theuninitialized perfect model (ii)

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2886

Page 6: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

current initialized real‐world predictions in comparison to initialized perfect‐model predictions (i minus iii),respectively.

3. Results3.1. Accuracy

We first analyze the MSSS of near‐surface temperature for the uninitialized and initialized perfect modelpredictions (comparing i and ii, Figure 1). We quantify the added value of initialization for predicting theCESM model historical reference simulation based on the simulations from all 46 initialization points 1January 1961 to 1 January 2006. The uninitialized perfect model shows significant skill over parts ofMiddle Asia, the Indian Ocean, the West Pacific, North America, and most of the Atlantic in lead years 1and 2 (Figures 1a and 1b). The skill improves considerably in lead years 2–5 and 2–9 everywhere in the globeexcept for the region north of 50°N between the North Atlantic Ocean and Alaska (Figures 1c and 1d). Afterinitialization, most areas of the globe show significant skill in lead year 1, but the skill in lead year 2 relativeto 1 reduces particularly over the Pacific (Figures 1e and 1f). The skill for lead years 2–5 and 2–9 increasesalmost everywhere, and the general skill patterns of the initialized runs (Figures 1g and 1h) are very similarto the case with no initialization. The regions in Figure 1 where the skill of the initialized runs is improvedover the uninitialized runs are mostly in the tropics (in the first 2 years) and the North Atlantic. Theseimprovements highlight the regions where it appears to be beneficial to phase in the internal variability ofthe predictions with variability of the reference climate used for evaluation, for example, initializing fromconditions of the El Niño Southern Oscillation (ENSO; McPhaden et al., 2006) or the Atlantic Multi‐decadalOscillation (AMO; Schlesinger & Ramankutty, 1994). Both the initialized and uninitialized perfect model

Figure 1. The mean squared skill score maps of the mean near‐surface temperature for the Community Earth SystemModel (CESM) (a–d) uninitialized perfect model, (e–h) the initialized perfect model; and the (i–l) difference mapbetween left and middle columns. The first row shows the skill for lead year 1, the second row lead year 2, the third rowlead years 2–5 average temperatures, and the fourth row lead years 2–9 average temperatures. Stippling indicates grid cellswhere the skill score or skill differences are significant at p ≤ 0.05 (two‐tailed), based on 200 bootstrap realizations.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2887

Page 7: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

predictions reveal negative skill over Greenland and Alaska, althoughinitialization does positively contribute to the skill over these areas(Figure 1). Negative skill shown in the uninitialized perfect model casemeans there is inconsistency between prediction ensembles and the refer-ence in terms of the warming trend. To test this, we have compared thewarming of the near‐surface temperature between the uninitialized cli-mate predictions and the perfect model temperature reference and foundthat the warming trend over Greenland and Alaska in the perfect modeltemperature case is much lower than that of other uninitialized climatepredictions (Figures S1e, S1h, S1k, S1n, and S1q).

The difference in skill between the initialized perfect model and the unin-itialized perfect model decreases with lead time: In lead year 1, 25% ofareas have significant skill contributed by initialization, followed by 11%in lead year 2 (Table 2). The number drops below 3% in lead years 2–5and then increases to 7% in lead years 2–9. This later increase in areawhere initialization significantly improves the prediction in lead years2–9 may point to predictability from low‐frequency (decadal) variability,but may also be affected by different temporal smoothing, and could alsobe due to pure chance (Livezey & Chen, 1983), in particular as there is alsoa significant area (6%, i.e., more than expected by chance) where initia-lized runs show lower skill compared to the uninitialized runs. Theseresults indicate that in lead years 1 and 2, there are substantially larger

areas with improved prediction skill due to initialization. These are primarily located in the tropicalPacific (mainly lead year 1), parts of the Southern Ocean, Indian Ocean, and North Atlantic, including somesurrounding land grid cells for example in Africa. Over longer lead times beyond 2 years, in contrast, we can-not detect significant improvements of prediction skill from initialization compared to the uninitialized runsat the global scale. However, significantly improved skill on the decadal scale (lead years 2–9) is found in thetropical Pacific, parts of the Indian Ocean, and North Atlantic (Figure 1l), but these increases are partly com-pensated by decreased skill over larger parts of Asia and the tropical Atlantic.

We next compared the skill of uninitialized real‐world climate predictions with uninitialized perfect model(comparing cases ii and iv) to investigate the margin for models to improve their ability to predict responsesto external forcing based on 46 initialization points (Figure 2). Note that two different references are used toevaluate the skill in ii (perfect‐model reference) and iv (observational reference). The skill of uninitializedclimate predictions shows relatively poor prediction skill in the first two lead years (where initializationhas been shown to play a substantial role in the analysis above), whereas it increases substantially in leadyears 2–5 and 2–9 except in the North Pacific. The initialized perfect‐model predictions show negative skillover Greenland and Alaska at all lead times (Figures 2e–2h; note that these are the same as Figures 1a–1d).The difference in skill between the uninitialized perfect model and uninitialized real‐world climate predic-tions shows significant improvement in the perfect model over the South Atlantic, large parts of the Pacificand north Indian Ocean. There are 20% of global areas where the perfect model is significantly outperform-ing the real‐world climate predictions in lead year 1, 19% in lead year 2, and around 36% in lead years 2–5and 2–9 (Figures 2i–2l). Less than 2.5% of areas show a significantly negative difference, which is less thanexpected by chance (Table 2). As the skill of uninitialized simulations primarily comes from the warmingtrend, the difference indicates that the long‐term behavior in particular in the Pacific Ocean is inconsistentbetween the model and the real world (Figure S1). The observed trend in the North Pacific is negative(Figure S1) and that it is not obvious whether over 46 years this trend is not dominated by internal processesor aerosols. The inconsistency between the model and observations could point to possible model shortcom-ings or simply indicate that the long‐term observed changes in these regions are highly unusual compared towhat is simulated by the model. Indeed, England et al. (2014) showed that recent observed tropical Pacificdecadal trends can be well outside the envelope of forced and natural variability resolved by CMIP5 climatemodels, including CCSM4. Most CMIP5 models also show stronger than observed warming in this regioncompared to recent trends (Power et al., 2016), with a poor representation of Atlantic‐Pacific tropical inter-actions the likely cause (Kajtar et al., 2017; McGregor et al., 2018). The CESMmodel further shows low skill

Table 2Changes of MSSS in Three Pairs of Comparisons

MSSS comparisonsLeadyear

(1) Significantimprovement (%)

(2) Significantnegative

difference (%)

Initialized perfect modelversus uninitializedperfect model

1 25 ‐‐

2 11 32–5 ‐‐ 82–9 7 6

Uninitialized perfect modelversus uninitializedclimate predictions

1 20 ‐‐

2 19 ‐‐

2–5 36 ‐‐

2–9 36 ‐‐

Initialized perfect modelversus decadal hindcasts

1 38 ‐‐

2 17 ‐‐

2–5 51 ‐‐

2–9 47 ‐‐

Note. The percentage of areas for mean squared skill score (MSSS) in threepairs of comparisons where (1) there is significant improvement and (2)there is significant negative difference. Grid cells are found significantat p ≤ 0.05 (two‐tailed), based on 200 bootstrap realizations. Note thatall percentage values are rounded to 1% accuracy, and values less than3% are not shown, as it is expected that 2.5% of the areas could be signifi-cant by chance.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2888

Page 8: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

in predicting upper ocean heat content in the tropical Pacific (Yeager et al., 2018). These inconsistenciessuggest that there could be considerable margin to improve the predictive skill of current models byimproving their simulation of decadal changes particularly in the Pacific Ocean.

We also analyzed prediction skill for some indices of large‐scale climate variability aggregating ocean surfacetemperatures over larger regions (NINO3.4 and AMO; see Figures S8 and S9). For these indices we also findpositive MSSS combined with added value from initialization primarily in the first lead year in the perfectmodel experiment (Figure S8). However, the anomaly correlations for the initialized perfect model are posi-tive for the first three to four lead years (Figure S9), suggesting that the lack of positive skill (measured asMSSS) beyond the first lead year may primarily be a consequence of different long‐term trends betweenthe reference and prediction runs (Supplementary Figure S1), affecting the conditional bias in equation (1).

We finally compare the prediction skill between initialized perfect model and decadal real‐world hindcaststo investigate how much our current initialized hindcasts could potentially be improved if perfect initializa-tion was possible and the model and real world behaved consistently (comparing i and iii; Figure 3). As thereal‐world hindcasts only have 14 initial years available, we restrict our analysis to the same number of pre-dictions for the initialized perfect model. There are only very limited areas where the skill of initialized hind-casts is significant in lead years 1 and 2, mainly in the North Atlantic and West Pacific in agreement withprevious decadal prediction studies (Doblas‐Reyes et al., 2013; van Oldenborgh et al., 2012). The skillincreases in lead years 2–5 and 2–9 over most areas except the Pacific Ocean (Figures 3c and 3d). The initi-alized perfect model (Figures 3e–3h) shows higher skill than the initialized hindcasts in most regions in leadyears 1 and 2 but still some patches with negative skill in the eastern Pacific. The absence of skill over theeastern Pacific area in lead years 1 and 2 for the hindcasts case (Figures 3a and 3b) can be related to the smallnumber of starting years and the incomplete initialization in CCSM4 hindcasts experiment (only ocean and

Figure 2. Same as Figure 1 but for the (a–d) uninitialized real‐world climate predictions (left column) and the (e–h) unin-itialized perfect model (same as Figure 1 left column). (i–l) Uninitialized perfect model here (Figures 2e–2h) is the same asFigures 1a–1d.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2889

Page 9: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

sea ice are initialized) and could also be due to the initial shock and the model drift, which can be strong inthe Pacific Ocean, giving rise to spurious ENSO variability (Sanchez‐Gomez et al., 2016; Yeager et al., 2018).Also note that prediction skill has improved based on a newer version of the prediction system withimproved initialization (Yeager et al., 2018). The percentage of global areas where the initialized perfectmodel significantly outperforms initialized hindcasts is 38% for lead year 1, 17% for lead year 2, and about50% for lead years 2–5 and 2–9 (Table 2).

As stated above, the perfect model results analyzed in Figures 1 and 2 of the main text are based on 46 initialyears, but for real‐world hindcasts only 14 initial years were available as in Figure 3—which may make theabove comparisons inconsistent. Therefore, in order to test the robustness of our results between the caseswith 14 and 46 initial years, we show results based on only 14 initial years in Figures S2 and S3. The patternsof skills are very similar between both cases—whether using the larger or smaller number of initial years—although in parts of the Pacific the larger sample of initialized runs is needed to obtain robust skill in the firsttwo lead years. However, the patterns of skill improvement due to initialization remain very similar.

Also, as HadCRUT4 is not globally complete, for Figures 2 and 3, the numbers of the percentage of areavalues given in Table 2 are all based on the total HadCRUT4 covered area, which might be different if therewere complete global observation coverage. In addition, it has been encouraged to consider more than onereference in forecast verification due to the uncertainty of observation data (Bellprat et al., 2017; Massonnetet al., 2016). Therefore, we repeated the analyses of real‐world prediction skill using another two sets ofobservational data—National Oceanic and Atmospheric Administration's MLOST (Smith et al., 2008;Vose et al., 2012; Figures S4 and S5) and GISTEMP (Hansen et al., 2010; Figures S6 and S7). The resultsare shown to be very similar as when we use HadCRUT4 as the reference and suggest that our conclusionsare not sensitive to the choice of observational reference data set. Note that another source of uncertainty isdue to the fact that most observational temperature data sets merge sea surface temperature over the ocean

Figure 3. Same as Figure 1 but for the (a–d) real‐world decadal hindcasts and the (e–h) initialized perfect model (same asFigure 1, middle column, but only for 14 starting years).

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2890

Page 10: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

and near‐surface air temperature over land (Cowtan et al., 2015), which is not the case in the model wherenear‐surface temperature is used over land and ocean.

3.2. Ensemble Spread

We use LESS to evaluate the ensemble spread. The results show that the ensemble spread looks fairly similarbetween the initialized and uninitialized perfect‐model predictions, both showing values close to zero (i.e.,not strongly underdispersive or overdispersive) in the first lead years, but larger areas with positive values(i.e., overdispersive) on the longer (e.g., 2–9 years) lead times (Figure 4). This indicates that the ensemblespread adequately depicts the variability of the historical run up to five lead years but becomes underconfi-dent on the decadal scale. Furthermore, there is a higher percentage of areas showing significantly negativedifference of LESS in Figures 4i–4l according to Table 3, which indicates that initialization slightly reducesthe ensemble spread at all lead times.

Both the uninitialized climate predictions and the initialized real‐world decadal hindcasts predictingobserved temperatures appear underdispersive in low latitudes and overdispersive at high latitudes at alllead times (Figures 5 and 6). In addition, the colors are darker for these two cases (the leftmost columnsof Figures 5 and 6) than the perfect model cases (the middle columns of Figures 5 and 6), suggesting thatthe perfect model tends to be less underdispersive or overdispersive than the model predictions of the realworld. Moreover, comparing the percentage of areas where ensembles are underdispersive (blue colored)between the uninitialized climate predictions (Figures 5a–5d) and the initialized real‐world decadal hind-casts (Figures 6a–6d), it shows more underdispersive areas in the hindcast case than the uninitialized

Figure 4. The ensemble spread scoremaps of the mean near‐surface temperature for the Community Earth SystemModel(CESM) (a–d) uninitialized perfect model, (e–h) the initialized perfect model, and the (i–l) difference map betweenleft andmiddle columns. The first row shows the ensemble spread score (LESS) for lead year 1, the second row lead years 2,the third row lead years 2–5, and the fourth row lead years 2–9. Stippling indicates grid cells where skill score or differ-ences are significant at p ≤ 0.05 (two‐tailed), based on 200 bootstrap realizations.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2891

Page 11: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure 5. Same as Figure 4 but for the (a–d) uninitialized real‐world climate predictions and the (e–h) uninitialized per-fect model (same as Figure 4 left column).

Table 3Changes of LESS in Three Pairs of Comparisons

LESS comparisonsLeadyear

(1) Positivedifference (%)

(2) Significantpositive

difference (%)

(3) Significantnegative

difference (%)

Initialized perfect model versusuninitialized perfect model

1 45 8 132 40 6 112–5 39 8 242–9 43 20 28

Uninitialized perfect model versusuninitialized climate predictions

1 54 54 462 54 54 462–5 60 60 372–9 60 59 35

Initialized perfect modelversus decadal hindcasts

1 76 53 112 70 48 122–5 75 54 102–9 66 51 15

Note. The percentage of areas for logarithmic ensemble spread score (LESS) in three pairs of comparisons where (1)there is positive difference, (2) there is significantly positive difference, and (3) there is significant negative difference.Grid cells are found significant at p ≤ 0.05 (two‐tailed), based on 200 bootstrap realizations. Note that all percentagevalues are rounded to 1% accuracy, and values less than 3% are not shown, as it is expected that 2.5% of the areas couldbe significant by chance.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2892

Page 12: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

climate prediction case. This also supports the statement that initialization contributes to decreasedensemble spread.

These findings suggest that the CESMmodel (regardless of whether initialized or not) appears overconfidentin describing the real‐world variability at low latitudes and underconfident at high latitudes. Also, initializa-tion helps to reduce ensemble dispersion.

4. Summary and Conclusions

In this study, we have set up a perfect model prediction experiment to identify the limits of achievable pre-diction skill with the CESM/CCSM4 model and compare the skill of this model in predicting the observedclimate on interannual to decadal time scales. Within the perfect model setup we compared a set of initia-lized versus uninitialized predictions to identify the added predictability from initializing the model simula-tions to specific states within the climate variability space. We find that initialization substantially improvesthe skill for the first two lead years in parts of the Southern Ocean, Indian Ocean, the tropical Pacific, andNorth Atlantic Ocean, and some surrounding land areas. But on longer time scales the skill is very similarfor initialized and uninitialized predictions, suggesting that predictability primarily comes from trendsrelated to external forcing on these longer scales (Boer, 2010; Boer et al., 2013; Meehl et al., 2014).Substantial differences in the warming trend between model and real‐world observations lower the decadalskill for real‐world predictions in particular in the Pacific Ocean. This suggests that improving the simula-tion of long‐term variability (or responses to forcing) especially in the Pacific would be one of themajor areasfor model improvement with regard to decadal predictability.

Previous studies investigating the real‐world prediction skill of temperature hindcasts for other models, suchas DePreSys (Smith et al., 2007) or other models providing decadal predictions within CMIP5, show broadlysimilar skill patterns as the CCSM4 decadal hindcasts, in both initialized cases and uninitialized cases(Doblas‐Reyes et al., 2013; Goddard et al., 2013; Guemas et al., 2013; van Oldenborgh et al., 2012). The

Figure 6. Same as Figure 4 but for the (a–d) real‐world decadal hindcasts and the (e–h) initialized perfect model.

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2893

Page 13: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

absence of skill over the Pacific Ocean area on decadal timescales (despite the ability of some systems to pre-dict ENSO a few years in advance; Gonzalez &Goddard, 2016) is commonly found in these cases and is likelyrelated to the models overestimating the warming trend and underestimating the magnitude of climatevariability in comparison to observations in particular in the Pacific (Power et al., 2016), which could berelated to biases in ocean mixing processes in climate models (Guemas et al., 2012). Therefore, improve-ments in the simulation of long‐term behavior may also improve decadal predictions here, which could haveramifications for predictability in other regions, as the Pacific is particularly strong in its signal of telecon-nections, both to other tropical basins as well as to the midlatitude and high latitude (Kosaka & Xie, 2013;Meehl et al., 2016; Turner, 2004).

Our initialized perfect model skill patterns are quite similar to the potential correlation skill patterns inves-tigated by Boer et al. (2013), although using different models, different metrics, and different approaches.The ability to predict responses to external forcings was also estimated in their paper, and the results showedthat the potential prediction skill of the externally forced component of temperature is largest at tropical lati-tudes, which is mostly consistent with our uninitialized perfect model results for lead years 2–5, but less sofor lead years 2–9.

Similar perfect‐model experiments with larger ensemble size could be performed to confirm the robustnessof the skill estimates, assess the reliability of these predictions, and possibly inform about optimal ensemblesizes for interannual to decadal predictions. In this context, it may be worthwhile to compare the muchmoreextensive hindcast experiment by Yeager et al. (2018) against a similar setup of perfect‐model predictions, toidentify the limit of achievable skill also in the context of a larger ensemble, and a substantially improvedprediction system.

The exact patterns of predictability are likely model dependent. We therefore encourage other modelinggroups to perform similar sets of perfect‐model experiments to compare to decadal hindcasts should be per-formed using a range of other models. It will then be more likely that we can identify the most robust char-acteristics of predictability in the climate system, without the confounding issue of model dependence.Furthermore, exploring the differences may help to identify the model components that contribute mostto poor predictability, thus helping to pinpoint the most pressing areas of model improvements.

ReferencesBellprat, O., Massonnet, F., Siegert, S., Prodhomme, C., Macias‐Gómez, D., Guemas, V., & Doblas‐Reyes, F. (2017). Uncertainty propaga-

tion in observational references to climate model scales. Remote Sensing of Environment, 203, 101–108. https://doi.org/10.1016/j.rse.2017.06.034

Boer, G. J. (2010). Decadal potential predictability of twenty‐first century climate. Climate Dynamics, 36, 1119–1133.Boer, G. J., Kharin, S., & Merryfield, W. (2013). Decadal predictability and forecast skill. Climate Dynamics, 41(7‐8), 1817–1833. https://doi.

org/10.1007/s00382‐013‐1705‐0Branstator, G., & Teng, H. (2010). Two limits of initial‐value decadal predictability in a CGCM. Journal of Climate, 23(23), 6292–6311.

https://doi.org/10.1175/2010JCLI3678.1Collins, M. (2002). Climate predictability on interannual to decadal time scales: The initial value problem. Climate Dynamics, 19(8),

671–692. https://doi.org/10.1007/s00382‐002‐0254‐8Collins, M., Botzet, M., Carril, A. F., Drange, H., Jouzeau, A., Latif, M., et al. (2006). Interannual to decadal climate predictability in the

North Atlantic: A multimodel‐ensemble study. Journal of Climate, 19(7), 1195–1203. https://doi.org/10.1175/JCLI3654.1Cowtan, K., Hausfather, Z., Hawkins, E., Jacobs, P., Mann, M. E., Miller, S. K., et al. (2015). Robust comparison of climate models with

observations using blended land air and ocean sea surface temperatures. Geophysical Research Letters, 42, 6526–6534. https://doi.org/10.1002/2015GL064888

Doblas‐Reyes, F. J., Andreu‐Burillo, I., Chikamoto, Y., García‐Serrano, J., Guemas, V., Kimoto, M., et al. (2013). Initialized near‐termregional climate change prediction. Nature Communications, 4(1), 1715. https://doi.org/10.1038/ncomms2704

Doblas‐Reyes, F. J., Balmaseda, M. A., Weisheimer, A., & Palmer, T. N. (2011). Decadal climate prediction with the European Centre forMedium‐RangeWeather Forecasts coupled forecast system: Impact of ocean observations. Journal of Geophysical Research, 116, D19111.https://doi.org/10.1029/2010JD015394

Dunstone, N., Smith, D., Scaife, A., Hermanson, L., Eade, R., Robinson, N., et al. (2016). Skilful predictions of the winter North AtlanticOscillation one year ahead. Nature Geoscience, 9(11), 809–814. https://doi.org/10.1038/ngeo2824

England, M. H., McGregor, S., Spence, P., Meehl, G. A., Timmermann, A., Cai, W., et al. (2014). Recent intensification of wind‐drivencirculation in the Pacific and the ongoing warming hiatus. Nature Climate Change, 4(3), 222–227. https://doi.org/10.1038/nclimate2106

Gent, P. R., Danabasoglu, G., Donner, L. J., Holland, M. M., Hunke, E. C., Jayne, S. R., et al. (2011). The Community Climate SystemModelVersion 4. Journal of Climate, 24(19), 4973–4991. https://doi.org/10.1175/2011JCLI4083.1

Goddard, K., Solomon, S., Boer, G., Kharin, M., Deser, M., Kirtman, M., et al. (2013). A verification framework for interannual‐to‐decadalpredictions experiments. Climate Dynamics, 40(1‐2), 245–272. https://doi.org/10.1007/s00382‐012‐1481‐2

Gonzalez, P. L. M., & Goddard, L. (2016). Long‐lead ENSO predictability from CMIP5 decadal hindcasts. Climate Dynamics, 46(9‐10),3127–3147. https://doi.org/10.1007/s00382‐015‐2757‐0

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2894

AcknowledgmentsThis study was supported by theAustralian Research Council grantsCE110001028, CE170100023,DE150100456, and DP150101331; theEarth Science and Climate Change Hubof the Australian Government'sNational Environmental ScienceProgramme (NESP) of the Departmentof the Environment and Energy; theAustralia‐Germany Joint Research Co‐operation (German AcademicExchange Service, DAAD) grant ID57219579; and the H2020 EUCP projectfrom the European Commission (EC)(GA 776613). We thank Harry H.Hendon (Bureau of Meteorology) forvaluable suggestions on this study. Wethank the climate modeling groupscontributing to CMIP5 for producingand making available their model out-put. We also acknowledge the Miklip(Mittelfristige Klimaprognose, https://www.fona‐miklip.de/) EvaluationSystem hosted by the German ClimateComputing Center (DeutschesKlimarechenzentrum, DKRZ), whichwas used to perform part of the ana-lyzes; We particularly thankChristopher Kadow and SebastianIlling (both at Freie Universität Berlin)for their support to the use of the Miklipevaluation tools. CMIP5 model outputdata are publicly available at http://cmip5.whoi.edu/ or https://pcmdi9.llnl.gov/. Temperature observation data areaccessible at https://climatedataguide.ucar.edu/climate‐data/global‐tempera-ture‐data‐sets‐overview‐comparison‐table. Our model output data are avail-able through doi: https://dx.doi.org/10.25914/5b9fa8780fdfa on theNational Computational Infrastructurein Australia.

Page 14: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Griffies, S. M., & Bryan, K. (1997). Predictability of North Atlantic multidecadal climate variability. Science, 275(5297), 181–184. https://doi.org/10.1126/science.275.5297.181

Grötzner, A., Latif, M., Timmermann, A., & Voss, R. (1999). Interannual to decadal predictability in a coupled ocean‐atmosphere generalcirculation model. Journal of Climate, 12(8), 2607–2624. https://doi.org/10.1175/1520‐0442(1999)012<2607:ITDPIA>2.0.CO;2

Guemas, V., Corti, S., Garcia‐Serrano, J., Doblas‐Reyes, F. J., Balmaseda, M., & Magnusson, L. (2013). The Indian Ocean: The region ofhighest skill worldwide in decadal climate prediction*. Journal of Climate, 26(3), 726–739. https://doi.org/10.1175/JCLI‐D‐12‐00049.1

Guemas, V., Doblas‐Reyes, F. J., Lienert, F., Soufflet, Y., & Du, H. (2012). Identifying the causes of the poor decadal climate prediction skillover the North Pacific. Journal of Geophysical Research, 117, D20111. https://doi.org/10.1029/2012JD018004

Hansen, J., Ruedy, R., Sato, M., & Lo, K. (2010). Global surface temperature change. Reviews of Geophysics, 48, RG4004. https://doi.org/10.1029/2010RG000345

Hawkins, E., Tietsche, S., Day, J. J., Melia, N., Haines, K., & Keeley, S. (2016). Aspects of designing and evaluating seasonal‐to‐interannualArctic sea‐ice prediction systems. Quarterly Journal of the Royal Meteorological Society, 142(695), 672–683. https://doi.org/10.1002/qj.2643

Hawkins, E. D., & Sutton, R. (2009). The potential to narrow uncertainty in regional climate predictions. Bulletin of the AmericanMeteorological Society, 90(8), 1095–1108. https://doi.org/10.1175/2009BAMS2607.1

Hunke, E. C., & Lipscomb, W. H. (2008). CICE: The Los Alamos sea ice model user's manual, version 4. Los Alamos National LaboratoryTech. Rep. LA‐CC‐06‐012 (76 pp.).

Illing, S., Kadow, C., Oliver, K., & Cubasch, U. (2014). MurCSS: A tool for standardized evaluation of decadal hindcast systems. Journal ofOpen Research Software, 2(1), e24.

Kadow, C., Illing, S., Kunst, O., Rust, H. W., Pohlmann, H., Müller, W. A., & Cubasch, U. (2016). Evaluation of forecasts by accuracyand spread in the MiKlip decadal climate prediction system. Meteorologische Zeitschrift, 25(6), 631–643. https://doi.org/10.1127/metz/2015/0639

Kajtar, J. B., Santoso, A., McGregor, S., England, M. H., & Baillie, Z. (2017). Model under‐representation of decadal Pacific trade windtrends and its link to tropical Atlantic bias. Climate Dynamics, 50, 1471–1484.

Kosaka, Y., & Xie, S.‐P. (2013). Recent global‐warming hiatus tied to equatorial Pacific surface cooling.Nature, 501(7467), 403–407. https://doi.org/10.1038/nature12534

Lawrence, D. M., Oleson, K. W., Flanner, M. G., Thornton, P. E., Swenson, S. C., Lawrence, P. J., et al. (2011). Parameterizationimprovements and functional and structural advances in version 4 of the Community Land Model. Journal of Advances in ModelingEarth Systems, 3, M03001. https://doi.org/10.1029/2011MS00045

Livezey, R. E., & Chen, W. Y. (1983). Statistical field significance and its determination by Monte Carlo Technoques. Monthly WeatherReview, 111(1), 46–59. https://doi.org/10.1175/1520‐0493(1983)111<0046:SFSAID>2.0.CO;2

Lorenz, E. N. (1969). Atmospheric predictability as revealed by naturally occurring analogues. Journal of the Atmospheric Sciences, 26,236–246.

MacLachlan, C., Arribas, A., Peterson, K. A., Maidens, A., Fereday, D., Scaife, A. A., et al. (2015). Global seasonal forecast system version 5(GloSea5): A high‐resolution seasonal forecast system. Quarterly Journal of the Royal Meteorological Society, 141(689), 1072–1084.https://doi.org/10.1002/qj.2396

Marotzke, J., Müller, W. A., Vamborg, F. S. E., Becker, B., Cubasch, U., Feldmann, H., et al. (2016). Miklip a national research project ondecadal climate prediction. Bulletin of the American Meteorological Society, 97(12), 2379–2394. https://doi.org/10.1175/BAMS‐D‐15‐00184.1

Massonnet, F., Bellprat, O., Guemas, V., & Doblas‐Reyes, F. J. (2016). Using climate models to estimate the quality of global observationaldata sets. Science, 354(6311), 452–455. https://doi.org/10.1126/science.aaf6369

McGregor, S., Stuecker, M. F., Kajtar, J. B., England, M. H., & Collins, M. (2018). Model tropical Atlantic biases underpin diminished Pacificdecadal variability. Nature Climate Change, 8(6), 493–498. https://doi.org/10.1038/s41558‐018‐0163‐4

McPhaden, M. J., Zebiak, S. E., & Glantz, M. H. (2006). ENSO as an Integrating Concept in Earth Science. Science, 314, 1740–1745.Meehl, G. A., Arblaster, J. M., Bitz, C. M., Chung, C. T. Y., & Teng, H. (2016). Antarctic sea‐ice expansion between 2000 and 2014 driven by

tropical Pacific decadal climate variability. Nature Geoscience, 9(8), 590–595. https://doi.org/10.1038/ngeo2751Meehl, G. A., Goddard, L., Boer, G., Burgman, R., Branstator, G., Cassou, C., et al. (2014). Decadal climate prediction: An update from the

trenches. Bulletin of the American Meteorological Society, 95(2), 243–267. https://doi.org/10.1175/BAMS‐D‐12‐00241.1Morice, C. P., Kennedy, J. J., Rayner, N. A., & Jones, P. D. (2012). Quantifying uncertainties in global and regional temperature change

using an ensemble of observational estimates: The HadCRUT4 data set. Journal of Geophysical Research, 117, D08101. https://doi.org/10.1029/2011JD017187

Mulholland, D., Laloyaux, P., Haines, K., & Balmaseda, M. A. (2015). Origin and impact of initialization shocks in coupled Atmosphere‐Ocean forecasts. Monthly Weather Review, 143(11), 4631–4644. https://doi.org/10.1175/MWR‐D‐15‐0076.1

Murphy, A. H. (1988). Skill Scores Based on the Mean‐Square Error and Their Relationships to the Correlation‐Coefficient. MonthlyWeather Review, 116, 2417–2425.

Murphy, A. H., & Epstein, E. S. (1989). Skill Scores and Correlation‐Coefficients in Model Verification. Monthly Weather Review, 117,572–581.

Neale, R., Richter, J., Park, S., Lauritzen, P. H., Vavrus, S. J., Rasch, P. J., & Zhang, M. (2013). The mean climate of the CommunityAtmosphereModel (CAM4) in forced SST and fully coupled experiments. Journal of Climate, 26(14), 5150–5168. https://doi.org/10.1175/JCLI‐D‐12‐00236.1

van Oldenborgh, G. J., Doblas‐Reyes, F. J., Wouters, B., & Hazeleger, W. (2012). Decadal prediction skill in a multi‐model ensemble.Climate Dynamics, 38(7‐8), 1263–1280. https://doi.org/10.1007/s00382‐012‐1313‐4

Pohlmann, H., Botzet, M., Latif, M., Roesch, A., Wild, M., & Tschuck, P. (2004). Estimating the decadal predictability of a coupled AOGCM.Journal of Climate, 17(22), 4463–4472. https://doi.org/10.1175/3209.1

Pohlmann, H., Müller, W. A., Kulkarni, K., Kameswarrao, M., Matei, D., Vamborg, F. S. E., et al. (2013). Improved forecast skill in thetropics in the new MiKlip decadal climate predictions. Geophysical Research Letters, 40, 5798–5802. https://doi.org/10.1002/2013GL058051

Power, S., Delage, F., Wang, G., Smith, I., & Kociuba, G. (2016). Apparent limitations in the ability of CMIP5 climate models tosimulate recent multi‐decadal change in surface temperature: Implications for global temperature projections. Climate Dynamics, 49,53–69.

Sanchez‐Gomez, E., Cassou, C., Ruprich‐Robert, Y., Fernandez, E., & Terray, L. (2016). Drift dynamics in a coupled model initialized fordecadal forecasts. Climate Dynamics, 46(5‐6), 1819–1840. https://doi.org/10.1007/s00382‐015‐2678‐y

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2895

Page 15: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Schlesinger, M. E., & Ramankutty, N. (1994). An oscillation in the global climate system of period 65‐70 years. Nature, 367(6465), 723–726.https://doi.org/10.1038/367723a0

Smith, D. M., Cusack, S., Colman, A. W., Folland, C. K., Harris, G. R., &Murphy, J. M. (2007). Improved surface temperature prediction forthe coming decade from a global climate model. Science, 317, 796–799.

Smith, R. D., Jones, P., Briegleb, B., Bryan, F., Danabasoglu, G., Dennis, J., et al. (2010). The Parallel Ocean Program (POP) referencemanual. Los Alamos National Laboratory Tech. Rep. LAUR‐10‐01853 (140 pp.).

Smith, T. M., Reynolds, R. W., Peterson, T. C., & Lawrimore, J. (2008). Improvements to NOAA's historical merged land–ocean surfacetemperature analysis (1880–2006). Journal of Climate, 21(10), 2283–2296. https://doi.org/10.1175/2007JCLI2100.1

Taylor, K. E., Stouffer, R. J., & Meehl, G. A. (2012). An overview of CMIP5 and the experiment design. Bulletin of the AmericanMeteorological Society, 93(4), 485–498. https://doi.org/10.1175/BAMS‐D‐11‐00094.1

Teng, H., Meehl, G. A., Bransator, G., Yeager, S., & Karspeck, A. (2017). Initialization shock in CCSM4 decadal prediction experiments. PastGlobal Changes Magazine, 25(1), 41–46. https://doi.org/10.22498/pages.25.1.41

Turner, J. (2004). The El Niño–southern oscillation and Antarctica. International Journal of Climatology, 24(1), 1–31. https://doi.org/10.1002/joc.965

Vose, R. S., Arndt, D., Banzon, V. F., Easterling, D. R., Gleason, B., Huang, B., et al. (2012). NOAA's Merged Land–Ocean SurfaceTemperature Analysis. Bulletin of the American Meteorological Society, 93(11), 1677–1685. https://doi.org/10.1175/BAMS‐D‐11‐00241.1

Yeager, S., Karspeck, A., Danabasoglu, G., Tribbia, J., & Teng, H. (2012). A decadal prediction case study: Late twentieth‐century NorthAtlantic Ocean heat content. Journal of Climate, 25(15), 5173–5189. https://doi.org/10.1175/JCLI‐D‐11‐00595.1

Yeager, S. G., Danabasoglu, G., Rosenbloom, N. A., Strand, W., Bates, S. C., Meehl, G. A., et al. (2018). Predicting near‐term changes in theEarth system: A large Ensemble of Initialized Decadal Prediction Simulations Using the Community Earth SystemModel. Bulletin of theAmerican Meteorological Society, 99(9), 1867–1886. https://doi.org/10.1175/BAMS‐D‐17‐0098.1

10.1029/2018JD029541Journal of Geophysical Research: Atmospheres

LIU ET AL. 2896

Page 16: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Supplementary Information: A framework to determine the limits of achievable skill for interannual to decadal climate predictions Yiling Liu, Markus. G. Donat, Andréa. S. Taschetto, Francisco. J. Doblas-Reyes, Lisa.

V. Alexander, Matthew. H. England

This document contains additional figures to support our discussions in the main text of the manuscript.

Page 17: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S1: The warming trend (or the difference of warming trend) of the near-surface temperature from 1961 to 2005: (a) for the ‘perfect model’ temperature reference; (b) for the Hadcrut4-median dataset; (c) difference between (b) and (c); (d)(g)(j)(m)(p) are for CCSM4 r1i1p1, r2i1p1, r4i1p1, r5i1p1 and r6i1p1 respectively; (e)(h)(k)(n)(q) are the warming trend differences between (d)(g)(j)(m)(p) and (b); and (f)(i)(l)(o)(r) are the warming trend differences between (d)(g)(j)(m)(p) and (c).

Page 18: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S2: Same as Figure 1, but based on the same 14 start years for which real-world decadal hindcasts were available.

Page 19: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S3: Same as Figure 2 but based on the same 14 years for which real-world decadal hindcasts were available.

Page 20: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S4: Same as Figure 2 but using MLOST data as the observational reference.

Page 21: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S5: Same as Figure 3 but using MLOST data as the observational reference.

Page 22: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S6: Same as Figure 2 but using GISTEMP data as the observational reference.

Page 23: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S7: Same as Figure 3 but using GISTEMP data as the observational reference.

Page 24: University of New South Wales - RESEARCH ARTICLE A …web.science.unsw.edu.au/~andrea/papers/Liu_etal_2019.pdf · 2019-05-09 · A Framework to Determine the Limits of Achievable

Figure S8: MSSS for the NINO3.4 index (top: a, b, c, d) and the AMO index (Trenberth and Shea 2006; bottom: e, f, g, h) in the different experiments. (a, e) initialized perfect model, (b, f) un-initialized perfect model, (c, g) un-initialized real-world climate prediction, (d, h) initialized real-world hindcasts. The blue lines indicate the skill for the different individual lead years, dashed and dotted lines the skill for the 2-5 and 2-9 year averages, respectively. Error bars denote the 5th to 95th percentile range of MSSS based on 200 bootstrap realizations.

Figure S9: As Figure S8 but for anomaly correlations.

References

Trenberth, K. E., and D. J. Shea (2006), Atlantic hurricanes and natural variability in 2005, Geophys. Res. Lett., 33, L12704, doi:10.1029/2006GL026894.