Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Regional Flood Frequency Analysis in the
Range of Small to Large Floods: Development
and Testing of Bayesian Regression-based
Approaches
Khaled Haddad
A thesis submitted for the degree of Doctor of Philosophy at the
University of Western Sydney, Sydney, Australia
June 2013
PRELIMINARIES
ii
ABSTRACT
Design flood estimation in the range of frequent to medium (2 – 100 years) and large to
rare (greater than 100 and up to 2000 years) average recurrence intervals (ARI) is
frequently required in the design of many engineering works such as design of culverts,
bridges, farm dams, spill ways, land use planning and flood insurance studies. These sorts
of infrastructure works and investigations are of notable economic significance.
Design flood estimation is ideally made adopting a flood frequency analysis technique;
however, this needs a relatively longer period of recorded streamflow data. In many cases,
recorded streamflow data is quite short or completely absent (i.e. ungauged catchment
situation). In such cases, regional flood frequency analysis (RFFA) techniques are usually
adopted, which attempts to utilise spatial data to compensate for temporal data on the
assumption of regional homogeneity.
This thesis focuses on RFFA techniques, in particular how the RFFA techniques can be
enhanced by adopting an ensemble of advanced statistical techniques as well as by
minimising the error and noise often found in the flood data. This thesis uses data from 682
catchments from the continent of Australia to (i) develop prediction equations involving
readily obtainable catchment characteristics data for floods in the frequent to medium range
ARIs (2 – 100 years) (ii) investigate the validation of the developed prediction equations
using the most commonly used leave-one-out validation (LOO) and to compare it with the
more recent Monte Carlo cross validation (MCCV) technique and (iii) to develop a large
flood regionalisation model (LFRM) that corrects for spatial dependence in the annual
maximum flood series data (AMFS) for flood estimation in the large to rare flood range
(100 – 2000 years ARI).
The first part of this thesis advocates the use of regression-based RFFA methods under the
Bayesian generalised least squares regression (BGLSR) framework. Here, the BGLSR has
been developed and tested with the quantile regression technique (QRT) and the parameter
regression technique (PRT) using 452 catchments from the east coast of Australia (namely
New South Wales (NSW), Victoria, Queensland and Tasmania). In forming the regions,
both the fixed region and region of influence (ROI) approaches have been examined in the
range of frequent to medium ARI floods.
PRELIMINARIES
iii
A LOO validation indicated that the ROI based on the minimisation of the predictive
uncertainty leads to more efficient and accurate flood quantiles estimates in both the QRT
and PRT regional frameworks. The regression diagnostics reveal that the catchment
characteristics variables alone may not pick up all the heterogeneity in the regional model
and formation of ROI sub-regions can reduce the heterogeneity level to an acceptable limit.
Both the BGLSR based QRT-ROI and PRT-ROI methods show improvements in regional
heterogeneity with an increase in the average pseudo coefficient of determination and a
decrease in the model error variance, average variance of prediction and the average
standard error of prediction. Based on the evaluation statistics, overall it has been found
that there are only modest differences between the QRT-ROI and PRT-ROI regional
frameworks. The developed RFFA methods based on the QRT-ROI and PRT-ROI allow
design flood estimation along with its associated uncertainty (in the form of confidence
limits) to be made with a relatively high degree of accuracy.
The second part of this thesis looks at the detailed validation of regional hydrological
regression models by investigating the popular LOO validation and the relatively new
MCCV procedures using 96 catchments from the state of NSW. In this regard, both the
ordinary least squares regression (OLSR) and GLSR have been tested for the estimation of
flood quantiles using simulated and observed regional flood data. From the simulation and
real data examples, it has been found that when developing regional hydrologic regression
models, application of GLSR based MCCV validation procedure is likely to result in a
more parsimonious model than the OLSR based LOO, OLSR based MCCV and GLSR
based LOO validation procedures.
The third part of this thesis proposes a simple LFRM that accounts for spatial dependence
in the AMFS data for estimating large to rare floods. To carry this out a comprehensive
dataset from all over the Australian continent has been used that consists of 654 stations.
The new LFRM is easy to use and offers an alternative to the traditional rainfall-based
methods. The development and application of the simplified LFRM for the Australian
continent consists of three major steps (i) pooling the top 1 to 5 annual maximum flood
values from member sites in a region (ii) developing a new spatial dependence model to
correct for spatially correlated data (iii) application of the LFRM to ungauged catchments
PRELIMINARIES
iv
by coupling it with the BGLSR – ROI technique to estimate the mean and coefficient of
variation (CV) of AMFS data.
To this end a simple model for the effective number of independent stations (Ne) has been
developed that ignores possible variation with ARI. Meaningful results regarding spatial
dependence have been established by undertaking the analysis on simulated datasets to
counteract sampling and homogeneity issues.
Overall, the experimental results of the analysis show that, in general, spatial dependence
decreases with larger network size and that some Australian states exhibit more spatial
dependence than others. While there are some limitations with this analysis, a reasonable
indication of the behaviour of Ne has been established. The derived generalised spatial
dependence model has then been used with the LFRM to correct for the spatial dependence
by adjusting the plotting position points of the LFRM frequency distribution curve.
An independent validation has showed that the developed LFRM is able to estimate design
floods for 100 to 1000 years ARIs with reasonable confidence as compared to at-site flood
frequency analysis results, other regional flood models and the world model. Overall, the
newly developed LFRM that corrects for spatially correlated data and coupled with
BGLSR - ROI approach offers a powerful yet simple method of regional flood estimation
for floods in the large to rare ARI range.
PRELIMINARIES
v
COPYRIGHT STATEMENT
‘I hereby grant the University of Western Sydney or its agents the right to archive and to
make available my thesis or dissertation in whole or part in the University libraries in all
forms of media, now or here after known, subject to the provisions of the Copyright Act
1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in
future works (such as articles or books) all or part of this thesis or dissertation. I have either
used no substantial portions of copyright material in my thesis or I have obtained
permission to use copyright material; where permission has not been granted I have
applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.’
Signed Khaled Haddad
PRELIMINARIES
vi
STATEMENT OF AUTHENTICATION
‘I hereby declare that the work presented in this thesis is solely my own work and that to
the best of my knowledge the work is original except where otherwise indicated by
references to other authors or works. No part of this thesis has been submitted for any other
degree or diploma.’
Signed Khaled Haddad
PRELIMINARIES
vii
ACKNOWLEDGMENTS
Firstly I would like to acknowledge the contribution of my supervisor Dr Ataur Rahman in
providing direction, advice and encouragement over the course of the last three and a half
years. I really appreciate your support.
I also appreciate the advice and friendship of other academics and researchers in the School
of Computing, Engineering and Mathematics at UWS. In particular, thanks to Associate
Professor Surendra Shrestha and Associate Professor Chin Leo.
The advice and friendship from other universities and industry is also gratefully
acknowledged. In particular Mr Erwin Weinmann of Monash University for his
constructive comments, valuable guidance, advice and encouragement throughout this
research, Professor George Kuczera, Associate Professor James Ball, Mr Mark Babister,
Mr Robert French and Dr William Weeks for their suggestions and input to the research. A
special thanks goes to Dr Nanda Nandakumar for his helpful advice on various aspects of
the issues relating to spatial dependence and large flood estimation.
I would also like to acknowledge various government departments throughout Australia for
their help and contribution in providing the streamflow data for this study. Without their
timely support this research would have not been completed on time.
To my fellow PhD students, thanks for all your help, fun times and the support of knowing
we’re not alone through the ups and downs. Thank you to my parents and family for
teaching me the value of education and hard work, which gave me the confidence to
embark on this mission.
PRELIMINARIES
viii
TABLE OF CONTENTS
ABSTRACT......................................................................................................................... II
COPYRIGHT STATEMENT............................................................................................ V
STATEMENT OF AUTHENTICATION........................................................................VI
ACKNOWLEDGMENTS ............................................................................................... VII
TABLE OF CONTENTS ...............................................................................................VIII
LIST OF FIGURES ........................................................................................................ XVI
LIST OF TABLES ........................................................................................................ XXII
COMMON NOTATIONS.............................................................................................XXV
ABBREVIATIONS.....................................................................................................XXVII
CHAPTER 1: INTRODUCTION....................................................................................... 1
1.1 GENERAL ................................................................................................................................1
1.2 BACKGROUND.......................................................................................................................1
1.3 THE NEED FOR THIS RESEARCH.......................................................................................7
1.4 RESEARCH QUESTIONS.......................................................................................................8
1.5 MAJOR TASKS........................................................................................................................9
1.6 CONTRIBUTIONS OF THIS RESEARCH TO THE UNDERSTANDING OF THE RFFA
PROBLEM....................................................................................................................................10
1.7 OUTLINE OF THE THESIS AND CHAPTER INTRODUCTIONS....................................10
CHAPTER 2: REVIEW OF REGIONAL FLOOD FREQUENCY ANALYSIS
TECHNIQUES, MODEL VALIDATION AND LARGE FLOODS ............................ 15
2.1 GENERAL ..............................................................................................................................15
2.2 BASIC ISSUES.......................................................................................................................15
2.2.1 REGIONAL FLOOD FREQUENCY ANALYSIS .............................................................15
2.2.2 REGIONAL HOMOGENEITY.........................................................................................16
2.2.3 INTER – SITE DEPENDENCE .......................................................................................17
2.2.4 DISTRIBUTIONAL CHOICES ........................................................................................18
2.3 METHODS FOR IDENTIFICATION OF HOMOGENEOUS REGIONS............................19
PRELIMINARIES
ix
2.4 REGIONAL FLOOD FREQUENCY ANALYSIS METHODS – DIFFERENT
APPROACHES.............................................................................................................................21
2.4.1 INDEX FLOOD METHOD..............................................................................................21
2.4.2 STATION YEAR METHOD .............................................................................................24
2.4.3 BAYESIAN ANALYSIS AND MONTE CARLO METHODS ............................................24
2.4.4 PROBABILISTIC RATIONAL METHOD AS USED IN AUSTRALIA .............................25
2.5 QUANTILE AND PARAMETER REGRESSION TECHNIQUES ......................................27
2.5.1 INTRODUCTION ............................................................................................................27
2.5.2 GENERALISED LEAST SQUARES AND WEIGHTED LEAST SQUARES REGRESSION
..................................................................................................................................................29
2.5.3 PREVIOUS APPLICATION OF GENERALISED LEAST SQUARES AND BAYESIAN
GENERALISED LEAST SQUARES REGRESSION .................................................................30
2.6 FIXED REGIONS AND THE REGION OF INFLUENCE IN REGIONAL FLOOD
FREQUENCY ANANALYS........................................................................................................33
2.6.1 FORMATION OF REGIONS...........................................................................................33
2.6.2 REGION OF INFLUENCE VS FLEXIBLE REGION......................................................33
2.7 MODEL VALIDATION IN HYDROLOGICAL REGRESSION ANALYSIS.....................36
2.7.1 HISTORY OF MODEL VALIDATION ............................................................................37
2.7.2 PREVIOUS APPLICATIONS OF LEAVE-ONE-OUT VALIDATION IN HYDROLOGY38
2.8 REGIONAL FLOOD FREQUENCY FOR LARGE TO RARE FLOODS............................40
2.8.1 BRIEF REVIEW OF LARGE FLOOD ESTIMATION AND PREVIOIUS
APPLICATIONS .......................................................................................................................40
2.9 IMPACT OF CLIMATE CHANGE ON FLOOD FREQUENCY ANALYSIS.....................44
2.10 SUMMARY ..........................................................................................................................45
CHAPTER 3: ADOPTED STATISTICAL TECHNIQUES FOR REGIONAL
FLOOD FREQUENCY ANALYSIS AND MODEL VALIDATION........................... 47
3.1 GENERAL ..............................................................................................................................47
3.2 AT-SITE FLOOD FREQUENCY ANALYSIS......................................................................49
3.2.1 BASICS OF AT-SITE FLOOD FREQUENCY ANALYSIS...............................................49
3.2.2 FLIKE SOFTWARE FOR AT-SITE FFA .........................................................................50
3.2.3 LOG PEARSON TYPE 3 (LP3) DISTRIBUTION............................................................50
3.3 THE CLASSICAL GLS REGRESSION PROBLEM ............................................................51
3.3.1 GLSR, THE STEDINGER AND TASKER MODEL .........................................................53
3.4 BAYESIAN METHODOLOGY.............................................................................................55
3.4.1 CLASSICAL BAYESIAN INFERENCE............................................................................56
3.5 BAYESIAN GLS REGRESSION ..........................................................................................56
PRELIMINARIES
x
3.5.1 APPROACH ADOPTED IN THIS STUDY FOR THE QUANTILE AND PARAMETER
REGRESSION TECHNIQUES .................................................................................................56
3.5.2 ADOPTED BAYESIAN REGRESSION APPROACH – PRIOR FOR THE β
COEFFICIENTS.......................................................................................................................59
3.5.3 ANALYTICAL SOLUTION TO BAYESIAN APPROACH FOR THE POSTERIOR OF
THE MODEL ERROR VARIANCE...........................................................................................60
3.5.4 PRIORS FOR THE PARAMETERS AND THE QUANTILES OF THE LP3
DISTRIBUTION........................................................................................................................62
3.6 SELECTING PREDICTOR VARIABLES ............................................................................64
3.6.1 AVERAGE VARIANCE OF PREDICTION .....................................................................64
3.6.2 BAYESIAN AND AKAIKE INFORMATION CRITERIA..................................................65
3.6.3 BAYESIAN PLAUSIBILTY VALUE .................................................................................65
3.6.4 COEFFICIENT OF DETERMINATION .........................................................................66
3.6.5 OTHER MODEL SELECTION CRITERIA......................................................................66
3.7 FORMATION OF REGIONS.................................................................................................67
3.8 REGRESSION DIAGNOSTICS.............................................................................................69
3.8.1 STANDARD ERROR OF PREDICTION .........................................................................70
3.8.2 RESIDUAL ANALYSIS ....................................................................................................70
3.8.3 COOK’S DISTANCE .......................................................................................................71
3.9 EVALUATION STATISITCS................................................................................................71
3.10 REGIONAL UNCERTAINTY WITH FLOOD QUANTILE ESTIMATION.....................72
3.10.1 THE MULTIVARIATE NORMAL DISTRIBUTION.......................................................73
3.11 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION MODELS –
METHODOLOGY........................................................................................................................76
3.11.1 THE HYDROLOGICAL REGRESSION PROBLEM .....................................................76
3.11.2 MODEL SELECTION BY MONTE CARLO CROSS VALIDATION .............................78
3.11.3 ESTIMATING MSEP .....................................................................................................80
3.11.4 APPLICATION – USING SIMULATED DATA.............................................................81
3.11.5 OBSERVED REGIONAL FLOOD DATA FROM NSW, AUSTRALIA ..........................83
3.12 SUMMARY ..........................................................................................................................84
CHAPTER 4: STUDY AREA AND PREPARATION OF STREAMFLOW AND
CATCHMENT CHARACTERISITICS DATA ............................................................. 85
4.1 GENERAL ..............................................................................................................................85
4.1.1 PUBLICATIONS..............................................................................................................86
4.2 STUDY AREA........................................................................................................................86
4.3 SELECTION OF CANDIDATE CATCHMENTS.................................................................87
PRELIMINARIES
xi
4.4 STREAMFLOW DATA PREPARATION.............................................................................89
4.4.1 FILLING MISSING RECORDS IN ANNUAL MAXIMUM FLOOD SERIES..................89
4.4.2 TREND ANALYSIS ..........................................................................................................89
4.4.3 RATING CURVE ERROR AND IDENTIFICATION .......................................................90
4.4.4 SENSIVITY ANALYSIS AND IMPACT OF RATING CURVE EXTRAPOLATION ON
FLOOD QUANTILE ESTIMATES............................................................................................92
4.4.5 TESTS FOR OUTLIERS ..................................................................................................94
4.5 RESULTS OF STREAMFLOW DATA PREPARATION PROCESS ..................................95
4.5.1 DATA PREPARATION FOR VICTORIA.........................................................................95
4.5.2 DATA PREPARATION FOR NSW AND ACT .................................................................99
4.5.3 SENSITIVITY ANALYSIS - IMPACT OF RATING CURVE ERROR ON FLOOD
QUANTILE ESTIMATES........................................................................................................102
4.6 SUMMARY RESULTS OF STREAMFLOW DATA PREPARATION FOR THE OTHER
STATES ......................................................................................................................................104
4.6.1 TASMANIA ....................................................................................................................104
4.6.2 QUEENSLAND..............................................................................................................105
4.6.3 SOUTH AUSTRALIA .....................................................................................................105
4.6.4 NORTHERN TERRITORY .............................................................................................105
4.6.5 WESTERN AUSTRALIA ................................................................................................105
4.6.6 SUMMARY OF STREAMFLOW DATA AUSTRALIA WIDE ........................................105
4.7 SELECTION AND ABSTRACTION OF CATCHMENT CHARACTERISITCS .............108
4.8 SUMMARY ..........................................................................................................................112
CHAPTER 5: RESULTS – RFFA BASED ON FIXED REGIONS AND REGION OF
INFLUENCE APPROACHES UNDER THE QUANTILE AND PARAMETER
REGRESSION FRAMEWORKS .................................................................................. 114
5.1 GENERAL ............................................................................................................................114
5.1.1 PUBLICATIONS............................................................................................................114
5.2 RESULTS FOR TASMANIA...............................................................................................115
5.2.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT ..................................115
5.2.2 PSUEDO ANOVA WITH QRT AND PRT MODELS FOR THE FIXED AND ROI
REGIONS................................................................................................................................119
5.2.3 ASSESMENT OF MODEL ASSUMPTIONS AND REGRESSION DIAGNOSTICS ......122
5.2.4 POSSIBLE SUBREGIONS IN TASMANIA....................................................................127
5.2.5 EVALUATION STATISTICS..........................................................................................128
5.3 SECTION SUMMARY ........................................................................................................130
5.4 RESULTS FOR NEW SOUTH WALES, VICTORIA AND QUEENSLAND ...................130
PRELIMINARIES
xii
5.4.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT ..................................131
5.5 REGION OF INFLUENCE VS. FIXED REGIONS FOR PARAMETER AND QUANTILE
REGRESSION TECHNIQUES ..................................................................................................140
5.5.1 REGRESSION DIAGNOSTICS – PSEUDO ANALYSIS OF VARIANCE......................140
5.5.2 REGRESSION DIAGNOSTICS – MODEL ADEQUACY AND OUTLIER ANANLYSIS
................................................................................................................................................144
5.5.3 DIAGNOSTIC STATISTICS...........................................................................................148
5.5.4 EVALUATION STATISTICS..........................................................................................152
5.6 SECTION SUMMARY ........................................................................................................156
5.7 UNCERTAINTY ESTIMATION FOR NEW SOUTH WALES, VICTORIA,
QUEENSLAND AND TASMANIA IN A ROI-PRT FRAMEWORK......................................157
5.8 SUMMARY ..........................................................................................................................160
CHAPTER 6: RESULTS - MODEL VALIDATION USING LOO AND MCCV .... 161
6.1 GENERAL ............................................................................................................................161
6.1.1 PUBLICATIONS............................................................................................................161
6.2 RESULTS .............................................................................................................................162
6.2.1 PREDICTORS USED ....................................................................................................162
6.2.2 SIMULATED DATA.......................................................................................................164
6.2.3 APPLICATION WITH OBSERVED REGIONAL FLOOD DATA IN NSW ...................169
6.3 SUMMARY ..........................................................................................................................175
CHAPTER 7: BACKGROUND AND DEVELOPMENT OF THE LARGE FLOOD
REGIONALISATION MODEL AND ISSUES RELATING TO SPATIAL
DEPENDENCE................................................................................................................ 177
7.1 GENERAL ............................................................................................................................177
7.1.1 PUBLICATIONS............................................................................................................177
7.2 LFRM CONCEPT.................................................................................................................178
7.3 INTER-SITE DEPENDENCE IN GENERAL FOR THE LFRM........................................178
7.4 ANNUAL MAXIMUM DATA SET USED IN THE LFRM...............................................182
7.4.1 QUALITY CHECK OF THE LARGEST ANNUAL MAXIMA DATA.............................183
7.5 IDENTIFICATION OF AN APPROPRIATE PROBABILITYY DISTRIBUTION AND
TESTING FOR HOMOGENITY OF ANNUAL MAXIMA FLOOD DATA...........................184
7.5.1 SEARCHING FOR AN APPROPRIATE PROBABILITY DISTRIBUTION...................184
7.5.2 GOODNESS-OF-FIT TEST RESULTS..........................................................................185
7.6 HOMOGENEITY .................................................................................................................190
7.6.1 HOMOGENEITY TEST OF HOSKING AND WALLIS .................................................190
PRELIMINARIES
xiii
7.6.2 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST ..........................191
7.6.3 TESTING FOR HOMOGENEITY – RESULTS..............................................................191
7.7 DEVELOPMENT OF THE LFRM MODEL FOR AUSTRALIAN FLOOD DATA..........192
7.7.1 DEVELOPMENT AND CALIBRATION OF THE LFRM MODEL...............................193
7.8 EFFECTS OF INTER-SITE DEPENDENCE ON THE LFRM MODEL............................201
7.8.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS ............................................201
7.8.2 REGIONAL MAXIMUM FLOOD AT A NETWORK OF SITES - REGIONAL
MAXIMUM AND TYPICAL CURVES....................................................................................202
7.8.3 FACTORS INFLUENCING THE REGIONAL MAXIMUM ..........................................203
7.8.4 NUMBER OF SITES, N .................................................................................................203
7.8.5 CROSS CORRELATION................................................................................................204
7.8.6 DEFINITION OF A REGION FOR ANALYSIS.............................................................205
7.8.7 METHODS OF SAMPLING REGIONAL MAXIMA......................................................205
7.8.8 ROI AND RANDOM ROI NETWORK METHODS .......................................................205
7.8.9 THE TOTAL RANDOM NETWORK METHOD............................................................206
7.8.10 COMPARING SAMPLING METHODS ......................................................................206
7.9 MEASURES OF Ne – EFFECTIVE NUMBER OF INDEPENDENT STATIONS ..............207
7.9.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS, Ne ......................................207
7.9.2 A SIMPLE MODEL FOR Ne ..........................................................................................209
7.9.3 FITTING Ne BY THE MEAN .........................................................................................210
7.10 SIMULATED DATASETS ................................................................................................211
7.10.1 SYNTHETIC DATA GENERATION ............................................................................211
7.11 SUMMARY ........................................................................................................................215
CHAPTER 8: APPLICATION OF LFRM IN THE LIGHT OF SPATIAL
DEPENDENCE – RESULTS AND DISCUSSION....................................................... 217
8.1 GENERAL ............................................................................................................................217
8.2 RESULTS FOR Ne ................................................................................................................217
8.3 A CLOSER LOOK AT THE BEHAVIOUR OF Ne ..............................................................219
8.4 GENERALISING THE Ne MODEL......................................................................................223
8.4.1 CONSTANT Ne MODEL – AN EMPIRICAL RELATIONSHIP FOR Ne BASED ON
AVERAGE CORRELATION COEFFICENT (ρ) ....................................................................223
8.4.2 FURTHER DISCUSSION..............................................................................................230
8.5 COMPARISON OF THE EFFECTIVE RECORD LENGTH ESTIMATES USING THE
CONSTANT Ne MODEL FOR THE REAL AND SIMULATED DATASETS .........................230
8.6 REVISITING THE LFRM IN THE LIGHT OF SPATIAL DEPENDENCE ......................231
8.7 APPLICATION OF THE LFRM MODEL TO UNGAUGED CATCHMENTS.................240
PRELIMINARIES
xiv
8.7.1 DERIVATION OF PRIORS FOR THE MEAN FLOOD AND CV.................................240
8.7.2 ESTIMATION OF THE ERROR COVARIANCE MATRIX – ESTIMATION OF THE
SAMPLING ERROR VARIANCE............................................................................................241
8.7.3 ESTIMATION OF THE SAMPLING ERROR – INTER-SITE CORRELATION............242
8.7.4 SOME ISSUES ASSOCIATED WITH REGIONAL ESTIMATION OF CV....................243
8.7.5 SELECTION OF PREDICTOR VARIABLES ................................................................244
8.7.6 BGLSR RESULTS FOR MEAN AND CV ......................................................................244
8.7.7 BGLSR RESULTS FOR MEAN AND CV MODELS USING ROI .................................248
8.8 VALIDATION......................................................................................................................251
8.9 SUMMARY ..........................................................................................................................257
CHAPTER 9: CONCLUSIONS ..................................................................................... 259
9.1 INTRODUCTION.................................................................................................................259
9.2 OVERVIEW OF THE STUDY ............................................................................................260
9.2.1 DATA SELECTION (CHAPTER 4) ...............................................................................260
9.2.2 RFFA IN THE FREQUENT TO MEDIUM ARI RANGE (CHAPTER 5) ......................261
9.2.3 MCCV VS LOO (CHAPTER 6)......................................................................................261
9.2.4 LARGE TO RARE FLOOD ESTIMATION (CHAPTERS 7 and 8)................................261
9.3 CONCLUSIONS...................................................................................................................262
9.3.1 DESIGN FLOOD ESTIMATION IN THE FREQUENT TO MEDIUM ARI RANGE....262
9.3.2 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION MODELS..............263
9.3.3 LARGE TO RARE FLOOD ESTIMATION....................................................................263
9.4 LIMITATIONS AND SUGGESTIONS FOR FUTURE RESEARCH ................................264
REFFRENCES................................................................................................................. 268
APPENDIX A................................................................................................................... 288
A.1 PUBLISHED PAPERS FROM THIS RESEARCH ............................................................288
APPENDIX B ................................................................................................................... 289
B.1 FURTHER RESULTS ASSOCIATED WITH VICTORIA AND QUEENSLAND (FROM
CHAPTER 5) ..............................................................................................................................289
APPENDIX C................................................................................................................... 295
C.1 FURTHER RESULTS ASSOCIATED WITH THE LFRM (FROM CHAPTERS 7 AND 8)
.....................................................................................................................................................295
APPENDIX D................................................................................................................... 306
D.1 L-MOMENT RATIO DIAGRAMS AND GOODNESS-OF-FIT TEST.............................306
PRELIMINARIES
xv
D.2 ANDERSON-DARLING MONTE CARLO SIMULATION GOODNESS-OF-FIT TEST
.....................................................................................................................................................307
D.3 HOMOGENEITY TEST OF HOSKING AND WALLIS ...................................................308
D.4 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST...........................309
D.5 GUMBEL VARIATES CORRESPONDING TO ARI........................................................311
PRELIMINARIES
xvi
LIST OF FIGURES
Figure 1 Flash flooding in Emerald Central Queensland (Oncirculation, 2011) ................ 2
Figure 2 Flow chart showing statistical techniques/ methods adopted in this thesis........... 48
Figure 3 Example of ROI techniques applied in this study ................................................. 69
Figure 4 Use of multivariate normal distribution to develop confidence limits by Monte
Carlo simulation........................................................................................................... 75
Figure 5 Plot of the selected study area (i.e. NSW, VIC, QLD and TAS) .......................... 87
Figure 6 Plot of rating ratios (RR) for station 222202......................................................... 92
Figure 7 Rating curve extension error ................................................................................. 94
Figure 8 (a) Time series plot showing significant trends after 1995 and (b) CUSUM test
plot showing significant trends after 1995. Here Vk is CUSUM test statistic defined in
McGilchrist and Wodyer (1975).................................................................................. 96
Figure 9 Histogram of rating ratios of annual maximum flood data in Victoria (stations
with record lengths > 25 years).................................................................................... 97
Figure 10 Distributions of streamflow record lengths of the selected 131 stations from
Victoria ........................................................................................................................ 98
Figure 11 Distributions of catchment areas of the 131 catchments from Victoria .............. 99
Figure 12 Histogram of rating ratios for 106 stations from NSW ..................................... 100
Figure 13 Distributions of streamflow record lengths of the selected 96 stations from NSW
.................................................................................................................................... 101
Figure 14 Distributions of catchment areas of the 96 catchments from NSW .................. 101
Figure 15 (a) Distribution of annual maximum flood record lengths of 682 stations from all
over Australia (b) Distribution of catchment areas of 682 stations from all over
Australia..................................................................................................................... 106
Figure 16 Geographical distributions of the selected 682 stations from all over Australia107
Figure 17 Selection of predictor variables for the BGLSR model for Q10 (QRT, fixed region
Tasmania), MEV = model error variance, AVPO = average variance of prediction
(old), AVPN = average variance of prediction (new), AIC = Akaike information
criterion, BIC = Bayesian information criterion, note 2
GLSR uses right hand axis ..... 118
Figure 18 Selection of predictor variables for the BGLSR model for skew...................... 119
PRELIMINARIES
xvii
Figure 19 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, fixed region, Tasmania) ............................................................................. 123
Figure 20 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, ROI, Tasmania).......................................................................................... 123
Figure 21 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, fixed region, Tasmania).................................................................................... 124
Figure 22 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, ROI, Tasmania) ................................................................................................ 124
Figure 23 Cook’s distance (Di) for locating outlier sites for skew model based on variable
combination 4............................................................................................................. 125
Figure 24 Spatial variations of the grouped minimum model error variances for Tasmania
(a) mean flood model and (b) skew model ................................................................ 128
Figure 25 Selection of predictor variables for the BGLSR model for the skew (note that
2GLSR uses the right-hand axis)................................................................................... 133
Figure 26 Selection of predictor variables for the BGLSR model for Q10 model (note that
uses the right-hand axis), (QRT, fixed region NSW), MEV = model error variance,
AVPO = average variance of prediction (old), AVPN = average variance of prediction
(new) AIC = Akaike information criteria, BIC = Bayesian information criteria....... 135
Figure 27 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, fixed region and ROI, NSW) ..................................................................... 144
Figure 28 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, fixed region, ROI, NSW).................................................................................. 146
Figure 29 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, ROI and PRT-ROI with weighted average standard deviation and skew,
NSW) ......................................................................................................................... 147
Figure 30 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, ROI, and PRT ROI with weighted average standard deviation and skew, NSW)
.................................................................................................................................... 148
Figure 31 Spatial variations of the grouped minimum model error variances for (a) mean
flood model and (b) number of sites which produced the lowest predictive variance for
the mean flood model................................................................................................. 152
Figure 32 Boxplots of Qpred/Qobs ratios for NSW for QRT and PRT, with fixed and ROI
regions........................................................................................................................ 155
PRELIMINARIES
xviii
Figure 33 Design flood quantile estimation and confidence limits curves for ARIs of 2 to
100 years .................................................................................................................... 159
Figure 34 The mean squared error of prediction (MSEP) associated with LOO and MCCV
for OLSR and GLSR simulations .............................................................................. 167
Figure 35 Prediction error plot for Q10 results (models selected by OLSR and GLSR LOO
and models selected by OLSR and GLSR MCCV) ................................................... 172
Figure 36 Prediction error plot for Q100 results (models selected by OLSR and GLSR LOO
and models selected by OLSR and GLSR MCCV) ................................................... 174
Figure 37 Occurrences of the highest floods – data from NSW, QLD, VIC and TAS are
combined (only the highest value from each station’s AMFS data is taken to form the
LFRM data series)...................................................................................................... 180
Figure 38 Cross-correlation between two nearby Victorian Stations 221201 and
221207(Considering all concurrent AMFS data over the period of records – only 21
data points are concurrent for the pair of stations) .................................................... 180
Figure 39 Relationship between the cross-correlations among AMFS data and distance
between pairs of stations in Victoria.......................................................................... 182
Figure 40 Geographical distribution of the 28 validation catchments for the LFRM ....... 183
Figure 41 L-moment ratio diagrams of annual maximum flood data for NSW and QLD 186
Figure 42 Visual inspection of distributional fit for GEV, GPA and P3 distributions for WA
and TAS ..................................................................................................................... 189
Figure 43 Scatter of Qmax/mean data in the (CV(Q), Qmax/mean) plane and non linear
interpolation function................................................................................................. 195
Figure 44 Scattering of Ymax data in the (CV(Q), Ymax) plane and linear interpolation
function for the pooling of 1 (1 max) and 5 (5 max) top maxima ............................. 197
Figure 45 Frequency distribution of the standardised Ymax values.................................... 200
Figure 46 Average concurrent record lengths for different network sizes ........................ 204
Figure 47 Example plot of regional maximum and typical growth curves and the effective
number of independent stations on a Gumbel plot for a random network of 2 and 4
gauging sites in Tasmania.......................................................................................... 209
Figure 48 Example plot of generated data with different constant correlation coefficients
for the state of Tasmania............................................................................................ 213
Figure 49 Variation of Ne with different network methods and experiment number for
NSW+QLD+VIC region (top panel for real data and bottom panel for simulated data)
.................................................................................................................................... 221
PRELIMINARIES
xix
Figure 50 Frequency of Ne with different network methods for NSW+QLD+VIC region
(top panel for real data and bottom panel for simulated data) ................................... 221
Figure 51 Regression results of the N = 2 network combining the lnNe/lnN ratio values for
all the Australian states/regions and experiments...................................................... 225
Figure 52 Regression results of the N = 4 network combining the lnNe/lnN ratio values for
all the Australian states/regions and experiments...................................................... 226
Figure 53 Regression results of the N = 8 network combining the lnNe/lnN ratio values for
all the Australian states/regions and experiments...................................................... 227
Figure 54 Comparison of directly computed Ne from the AMFS data and Ne by the constant
Ne model..................................................................................................................... 229
Figure 55 Variation with number of sites: effective record lengths estimated using real and
simulated Ne models as a function of average correlation coefficient ....................... 231
Figure 56 Frequency distribution of standardised Ymax values using N and Ne stations..... 234
Figure 57 Various Qmax/mean quantiles derived from the LFRM_Ne model and PM (World)
model.......................................................................................................................... 236
Figure 58 Empirical frequency distributions of Q/mean quantiles derived from the
LFRM_N and LFRM_Ne for different ranges of CV ................................................ 239
Figure 59 Relationship between CV and catchment area .................................................. 243
Figure 60 Selection of predictor variables for the BGLSR model for CV ........................ 246
Figure 61 Selection of predictor variables for the BGLSR model for CV using AVPO,
AVPN, AIC and BIC ................................................................................................. 246
Figure 62 Selection of predictor variables for the BGLSR model for the mean flood...... 247
Figure 63 Selection of predictor variables for the BGLSR model for the mean flood using
AVPO, AVPN, AIC and BIC .................................................................................... 248
Figure 64 Prior and posterior pdf's for the model error variance for CV (right) and the mean
flood (left) models for NSW state.............................................................................. 251
Figure 65 Confidence interval plot of BIASr values with the LFRM_Ne and PM (world)
models for the 28 test catchments.............................................................................. 256
Figure 66 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, fixed region, VIC)...................................................................................... 291
Figure 67 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, ROI, VIC) .................................................................................................. 291
Figure 68 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, fixed region, VIC)............................................................................................. 292
PRELIMINARIES
xx
Figure 69 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, ROI, VIC) ......................................................................................................... 292
Figure 70 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, fixed region, QLD) .................................................................................... 293
Figure 71 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT
and PRT, ROI, QLD) ................................................................................................. 293
Figure 72 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, fixed region, QLD) ........................................................................................... 294
Figure 73 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and
PRT, ROI, QLD)........................................................................................................ 294
Figure 74 L-moment ratio diagram of annual maximum flood series data for VIC.......... 295
Figure 75 L-moment ratio diagram of annual maximum flood series data for WA .......... 295
Figure 76 L-moment ratio diagram of annual maximum flood series data for SA............ 296
Figure 77 L-moment ratio diagram of annual maximum flood series data for TAS ......... 296
Figure 78 L-moment ratio diagram of annual maximum flood series data for NT ........... 297
Figure 79 Visual inspection of distributional fit for GEV, GPA and P3 distributions for
NSW........................................................................................................................... 297
Figure 80 Visual inspection of distributional fit for GEV, GPA and P3 distributions for VIC
.................................................................................................................................... 298
Figure 81 Variation of Ne with different network methods and experiment number for TAS
region (top panel for real data and bottom panel for simulated data)........................ 298
Figure 82 Frequency of Ne with different network methods for TAS region (top panel for
real data and bottom panel for simulated data).......................................................... 299
Figure 83 Variation of Ne with different network methods and experiment number for NT
region (top panel for real data and bottom panel for simulated data)........................ 299
Figure 84 Frequency of Ne with different network methods for NT region (top panel for real
data and bottom panel for simulated data)................................................................. 300
Figure 85 Variation of Ne with different network methods and experiment number for WA
region (top panel for real data and bottom panel for simulated data)........................ 300
Figure 86 Frequency of Ne with different network methods for WA region (top panel for
real data and bottom panel for simulated data).......................................................... 301
Figure 87 Variation of Ne with different network methods and experiment number for SA
region (top panel for real data and bottom panel for simulated data)........................ 301
Figure 88 Selection of predictor variables for the BGLSR model for CV - WA .............. 302
PRELIMINARIES
xxi
Figure 89 Selection of predictor variables for the BGLSR model for CV using AVPO,
AVPN, AIC and BIC - WA ....................................................................................... 302
Figure 90 Selection of predictor variables for the BGLSR model for the mean flood – WA
.................................................................................................................................... 303
Figure 91 Selection of predictor variables for the BGLSR model for the mean flood using
AVPO, AVPN, AIC and BIC - WA .......................................................................... 303
Figure 92 Selection of predictor variables for the BGLSR model for CV – TAS............. 304
Figure 93 Selection of predictor variables for the BGLSR model for CV using AVPO,
AVPN, AIC and BIC - TAS ...................................................................................... 304
Figure 94 Selection of predictor variables for the BGLSR model for the mean flood – TAS
.................................................................................................................................... 305
Figure 95 Selection of predictor variables for the BGLSR model for the mean flood using
AVPO, AVPN, AIC and BIC - TAS.......................................................................... 305
PRELIMINARIES
xxii
LIST OF TABLES
Table 1 Flood quantile estimates and associated errors using ARR FLIKE with and without
consideration of rating curve error............................................................................. 103
Table 2 Summary of selected stations Australia wide ....................................................... 107
Table 3 Catchment characteristics variables used in the study.......................................... 109
Table 4 Different combinations of predictor variables considered for the QRT models and
the parameters of the LP3 distribution (QRT and PRT fixed region Tasmania) ....... 117
Table 5 Pseudo ANOVA table for Q20 model for Tasmania (QRT, fixed region and ROI)
.................................................................................................................................... 120
Table 6 Pseudo ANOVA table for Q100 model for Tasmania (QRT, fixed region and ROI)
.................................................................................................................................... 121
Table 7 Pseudo ANOVA table for the mean flood model for Tasmania (PRT, fixed region
and ROI)..................................................................................................................... 121
Table 8 Pseudo ANOVA table for the standard deviation model for Tasmania (PRT, fixed
region and ROI) ......................................................................................................... 121
Table 9 Pseudo ANOVA table for the skew model for Tasmania (PRT, fixed region and
ROI) ........................................................................................................................... 122
Table 10 Regression diagnostics for fixed region and ROI for Tasmania......................... 126
Table 11 Model error variances associated with fixed region and ROI for Tasmania (n =
number of sites in the region) .................................................................................... 127
Table 12 Evaluation statistics (RMSEr and REr) from leave-one-out (LOO) validation for
Tasmania .................................................................................................................... 129
Table 13 Summary of counts/percentages based on the rr values for QRT and PRT for
Tasmania (fixed region). “U” = gross underestimation, “D” = desirable range and “O”
= gross overestimation ............................................................................................... 129
Table 14 Summary of counts/percentages based on the rr values for QRT and PRT for
Tasmania (ROI). “U” = gross underestimation, “D” = desirable range and “O” = gross
overestimation............................................................................................................ 129
Table 15 Summary of the final BGLSR results for NSW ................................................. 131
Table 16 Summary of the catchment characteristics and statistical measures used in the
stepwise regression for the parameters of the LP3 distribution for NSW ................. 136
PRELIMINARIES
xxiii
Table 17 Summary of the catchment characteristics and statistical measures used in the
forward stepwise regression for the flood quantiles of the LP3 distribution (ARIs = 2,
10 and 100 years) for NSW ....................................................................................... 137
Table 18 Pseudo ANOVA table for the mean flood model (PRT, fixed region and ROI,
NSW, VIC and QLD states) (Here n = number of sites in the region, k = number of
predictors in the regression equation, EVR = error variance ratio, 2
0 = model error
variance when no predictor variable is used in the regression model, 2 = model error
variance when predictor variable is used in the regression model and )]ˆ([ ytr = sum of
the diagonals of the sampling covariance matrix) ..................................................... 141
Table 19 Pseudo ANOVA table for the skew model (PRT, fixed region and ROI, NSW,
VIC and QLD states) (variables are explained in Table 18 caption)......................... 142
Table 20 Pseudo ANOVA table for Q20 model (QRT, fixed region and ROI for NSW, VIC
and QLD states) (variables are explained in Table 18 caption)................................. 143
Table 21 Regression diagnostics for the fixed region and ROI for NSW, VIC and QLD. 149
Table 22 Model error variances associated with the fixed region and ROI for NSW, VIC
and QLD (n = number of sites needed for the LP3 parameters and flood quantiles) 151
Table 23 Evaluation statistics (RMSEr and REr) from LOO validation for NSW (Results
NSW for PRT using the weighted regional average standard deviation and skew
models, i.e. no predictor variables given in brackets), VIC and QLD ...................... 153
Table 24 Summary of predictor variables (here log10 is used) .......................................... 162
Table 25 Correlation between the log10 predictor variables used in the analysis .............. 163
Table 26 Results from simulated data, OLSR when 2 = 1............................................... 166
Table 27 Results from simulated data, OLSR when 2 = 0.04.......................................... 168
Table 28 Results from simulated data, GLSR when 2 = 0.903 and )ˆ,ˆ(ˆ ji yy = 0.30 ... 168
Table 29 Results from simulated data, GLSR when 2 = 0.063 and )ˆ,ˆ(ˆ ji yy = 0.70 ... 168
Table 30 OLSR analysis, MSEP values for calibration and validation data set (observed
data from NSW). Here log10 is used .......................................................................... 171
Table 31 GLSR analysis, MSEP values for calibration and validation data set (observed
data from NSW). Here log10 is used .......................................................................... 171
Table 32 OLSR and GLSR analysis for LOO and MCCV for Q10, optimal models shown
along with summary statistics.................................................................................... 171
Table 33 MSEP for ARI = 100-year .................................................................................. 173
PRELIMINARIES
xxiv
Table 34 OLSR and GLSR analysis for LOO and MCCV for Q100, optimal models shown
along with summary statistics.................................................................................... 173
Table 35 Summary of goodness-of-fit tests for determining parent distribution............... 187
Table 36 Summary of MRE associated with the GEV and P3 distributions ..................... 188
Table 37 Summary of heterogeneity measures for the Australia states............................. 191
Table 38 Coefficients of non linear interpolation from Figure 43..................................... 194
Table 39 Coefficients and R2 values of Ymax polynomial interpolating from Figure 45 ... 199
Table 40 Comparison of the parameters of the parent distribution and the distribution for
the generated data (distribution: F(x)=exp[-1-(x-)/1/]) and correlation
coefficient, ρ. ............................................................................................................. 214
Table 41 Experimental values of Ne for different networks and regions using the real data
(average Ne over the experiment reported) ................................................................ 218
Table 42 Experimental values of Ne for different networks and regions using the simulated
data (average Ne over the experiment reported)......................................................... 219
Table 43 Experimental results in which Ne exceeds N at a particular ARI for different
regions using the real data set .................................................................................... 222
Table 44 Properties of the Constant Ne Spatial dependence model ................................... 228
Table 45 for each pair of sites for the different states/region ....................................... 232
Table 46 Total record length (L) and effective record length (Le) for the all Australian
dataset ........................................................................................................................ 232
Table 47 Coefficients and R2 values of Ymax polynomial interpolation from Figure 56 for N
and Ne sites................................................................................................................. 233
Table 48 CV values for study catchments in Australia...................................................... 237
Table 49 Summary of the finally selected BGLSR models for all the Australian states used
in the validation of LFRM ......................................................................................... 244
Table 50 Regression diagnostics for the ROI approach for the various Australian states and
test catchments ........................................................................................................... 249
Table 51 Summary of error statistics obtained from independent testing associated with the
LFRM model.............................................................................................................. 255
Table 52 Summary of the final BGLSR results for VIC ................................................... 289
Table 53 Summary of the final BGLSR results for QLD .................................................. 290
Table 54 Values of YT corresponding to ARI.................................................................... 311
PRELIMINARIES
xxv
COMMON NOTATIONS
A catchment area
a, b constants
CS coefficient of skewness
CV coefficient of variation
e error term in regression analysis
G regional mean skewness coefficient
H heterogeneity measure
IN identity matrix
k number of parameters in regression model
L total record lengths of all sites in a group
Le record lengths after correcting for spatial dependence
l sample l moment
LSK L coefficient of skewness
LCV L coefficient of variation
LKT L coefficient of kurtosis
N total number of streamflow records used in flood frequency analysis and also used
to define number of simulations undertaken
Ne number of independent stations after correcting for spatial dependence
n total number of datasets in regression analysis
na average number of stations in a region
Nsim number of simulated regions for homogeneity testing
p probability
QT flood quantile having return period of T years
R2 coefficient of determination used in OLSR
T return period
tc time of concentration
X nxk matrix of basin characteristics
kx1 vector of regression coefficients
y vector of dependant variables in regression model
ox row vector of basin characteristics at site 0
PRELIMINARIES
xxvi
2 model error variance
2ˆ sample estimate of model error variance
covariance matrix of regression errors
data based estimate of
2 residual variance from OLSR
prior mean of the model error variance used in BGLSR
Q mean annual flood
i population mean at site i
i population standard deviation i
q sample mean of logs of annual maxima at station i
parameter of probability distribution
TK frequency factor for return period T in log Pearson type 3 flood frequency analysis
correlation between sites used in large flood model regionalisation
i residual error term associated with regression of mean
sampling error covariance matrix
ij correlation/distance relationship between stations
ij estimated correlation/distance relationship between stations
2
GLSRR pseudo coefficient of determination used in GLS regression
PRELIMINARIES
xxvii
ABBREVIATIONS
‘AD’ Anderson Darling Statistic for Homogeneity
ARR-FLIKE Australian Rainfall and Runoff flood frequency analysis software
AUSIFD Australian Intensity Frequency Duration software
AEP Annual Exceedance Probability
AIC Akaike Information Criteria
AMFS Annual Maximum Flood Series data
ARR Australian Rainfall and Runoff
ARI Average Recurrence Interval
ASPE Average Squared Prediction Error
AVPO Average Variance of Prediction for old site
AVPN Average Variance of Prediction for new site
BGLSR Bayesian Generalised Least Squares Regression
BIC Bayesian Information Criteria
BOM Bureau of Meteorology
BPV Bayesian Plausibility Value
Area Catchment area (km2)
CD Compact disk
CMCCV Corrected Monte Carlo Cross validation
FFA Flood Frequency Analysis
forest Fraction of basin covered by medium to dense forest
qsa Fraction quaternary sediment area
GEV Generalised Extreme Value distribution
GLSR Generalised Least Squares Regression
GPA Generalised Pareto distribution
IFM Index Flood Method
ID Instantaneous Discharge
IFD Intensity Frequency Duration
IM Monthly Instantaneous Maximum Data
MMD Monthly Maximum Mean Daily Data
LFRM Large Flood Regionalisation Model
LOO Leave-One-Out Validation
PRELIMINARIES
xxviii
LP3 log Pearson type 3 distribution
MCCV Monte Carlo Cross validation
MCMC Markov Chain Monte Carlo
MEV Model Error Variance
MOM Method of Moments Estimator
ML Maximum Likelihood Estimator
MSEP Mean Squared Error of Prediction
MVN Multivariate Normal Distribution
evap Mean annual evapotranspiration (mm)
rain Mean annual rainfall (mm)
NERC Natural Environment Research Council (UK)
NCWE National Committee on Water Engineering
NSW New South Wales
OLSR Ordinary Least Squares Regression
P3 Pearson type 3 distribution
PM Probabilstic Model
PRM Probabilistic Rational Method
PRT Parameter Regression Technique
PD, POT Peaks over threshold
QLD Queensland
QRT Quantile Regression technique
RR Rating Ratio
TID Rainfall intensity of D-hour duration and T-year average recurrence interval
(mm/hr)
RFFA Regional Flood Frequency Analysis
ROI Region of Influence
C Runoff Coefficient used in rational method
Sden Stream Density (km/ km2)
Slope Slope of central 75% of mainstream S1085 (m/km)
SEP Standard error of prediction
USGS United States Geological Survey
VIC Victoria
WLSR Weighted Least Squares Regression
CHAPTER 1
1
CHAPTER 1: INTRODUCTION
1.1 GENERAL
This thesis focuses on design flood estimation problem in ungauged catchments in the
range of frequent to rare average recurrence intervals (ARIs) (2 to 2000 years) using
regional flood frequency analysis (RFFA) approaches. The RFFA attempts to transfer flood
characteristics information from gauged to ungauged catchments using the concept of
‘homogeneous regions’. This thesis, in particular, investigates the research question how
flood quantile estimation in ungauged catchments can be enhanced by adopting an
ensemble of advanced statistical techniques. The RFFA approaches developed in this thesis
attempt to minimise errors in design flood estimation through a stringent data preparation
scheme, use of sophisticated statistical techniques and an in-depth validation of the
techniques applied. This chapter provides the background, need and objectives of this
research and an overview of the thesis.
1.2 BACKGROUND
The flood phenomenon is a part of the natural disturbance regime, and an intrinsic
component of the natural climate system. It can also be one of the most destructive hydro-
meteorological phenomena in terms of its impacts on human well-being and socioeconomic
activities.
One just has to see the considerable damage caused by flooding that has taken place in the
north-eastern state of Australia (namely Queensland) in 2010, 2011 and 2012 to really
understand the significance of this issue. The death toll is estimated to be 30 – 35 and the
people affected by the flooding reached 2.1 million. The damage caused is estimated to be
around $20 billion. Other losses that are worth noting are the disruption in trade such as the
loss in agriculture and mining, which are both important revenue incomes for Queensland
and Australia. Figure 1 shows the damage caused by flash flooding in Emerald Central
Queensland in January 2011.
CHAPTER 1
2
Figure 1 Flash flooding in Emerald Central Queensland (Oncirculation, 2011)
To estimate the frequency and magnitude of floods for design purposes, the availability of
streamflow data is a fundamental requirement. Flood frequency analysis is often used by
practitioners to support the design of river engineering works, flood mitigation measures
and civil protection strategies. It is generally carried out by fitting peak flow observations
to a suitable probability distribution (Baratti et al., 2012). The estimation of probability of
exceedance for frequent to rare floods is essentially an extrapolation exercise based on
limited observed flood data. Thus the larger the database the more accurate the estimate
should be. From a statistical point of view, estimation from a small sample is likely to give
unreasonable or physically unrealistic parameter sets, especially for the probability
distributions with a large number of parameters (i.e. three or more). In practice, however,
recorded flood data may be quite limited. In many cases, these data may be completely
absent (i.e. ungauged catchment cases). In such situations, RFFA is adopted.
The RFFA serves two purposes, for sites where streamflow data are not available the
analysis is based on regional data (Cunnane, 1989). For sites with available data, the joint
use of data measured at a site, called at-site data, and regional data from a number of
stations in the region provides sufficient information to enable a probability distribution to
be used with greater confidence (Dawdy et al., 2012). This type of analysis represents a
CHAPTER 1
3
substitution of space for time where data from different locations in a region are used to
compensate for short records at a single site (Stedinger et al., 1993).
RFFA consists of three major steps: (a) identification of homogeneous regions; (b)
development of estimation models; and (c) validation of the estimation models. To form
the homogeneous regions, traditional approaches such as geographical and administrative
regions have often been adopted (I. E. Aust., 1987; Acreman and Sinclair, 1986; Acreman,
1987; Tasker et al., 1996 and Eng et al., 2005); however, these regions often lack in
hydrological similarity (Burn, 1990a and 1990b; Hosking and Wallis, 1993; Merz and
Blöschl, 2005 and Chebana and Ouarda, 2008). Regions based on climatic and physical
catchment characteristics have been proposed (Tasker et al. 1996; Bates et al., 1998 and
Rahman et al., 1999a). Moreover, to avoid problems associated with fixed boundaries, the
region of influence (ROI) approach has been adopted (Burn, 1990a and 1990b; Tasker et
al., 1996; Zringi and Burn, 1996; Merz and Blöschl, 2005; Eng et al., 2007a, b and Gaál et al., 2008). One critical issue here is how to assign an ungauged catchment to the
appropriate region when there is more than one possible region (Bates et al., 1998).
In relation to the estimation model, several approaches have been proposed. They include
the probabilistic rational method (PRM), the index flood method (IFM) and the quantile
regression technique (QRT). In south–east Australia, the PRM is recommended for general
use in Australian Rainfall and Runoff (ARR) mainly due to its simplistic nature (I. E. Aust.,
1987). The essential component of this method is a dimensionless runoff coefficient, which
ARR assumes to vary smoothly over geographical space. This assumption may not be
satisfied in many cases, because two nearby catchments can exhibit quite different physical
features. Also, values associated with these runoff coefficients are estimated using
conventional moment estimates with flow records of limited length (some sites had only 10
years of record in the analysis with the ARR1987 RFFA methods). This means that these
runoff coefficient values are affected by severe sampling variability, which can then
introduce significant bias and uncertainty into the final design flood estimates. Criticism
has also been linked to the way the runoff coefficients are mapped; this can be attributed to
the assumption of geographical contiguity as a surrogate to hydrological similarity, an
assumption open to wide criticism. It is also worth mentioning the lack of independent
validation with the PRM in ARR1987.
CHAPTER 1
4
The IFM or index frequency approach (being applicable to both flood and rainfall
estimation) (e.g. Fill and Stedinger, 1998; Madsen et al., 2002; Bocchiola, et al., 2003;
DiBaldassarre et al., 2006 and Lim and Voeller, 2009) has been a popular approach for
estimating flood quantiles since 1960 (Dalrymple, 1960). ARR (I. E Aust., 1987) did not
favour the IFM as a design flood estimation technique for Australia. The IFM had been
criticised on the grounds that the coefficient of variation of the flood series may vary
inversely approximately with catchment area, thus resulting in flatter flood frequency
curves for larger catchments. In the United Kingdom (UK), an index flood method is
currently recommended in the Flood Estimation Handbook (FEH) where the index flood is
taken as the median annual maximum flood. The growth curve for any site is estimated
using a pooling group, which is formed using catchments considered to be hydrologically
similar to the site of interest. The FEH recommends the generalised logisitic (GLO)
distribution combined with the method of L-moments for growth curve estimation. The
FEH RFFA approach was upgraded again by Kjeldsen et al. (2008) and as documented in a
series of papers (e.g. Kjeldsen and Jones, 2009a, 2009b and 2010).
The United States Geological Survey (USGS) proposed a QRT where a large number of
gauged catchments are selected from a region and flood quantiles are estimated from
recorded streamflow data, which are then regressed against catchment variables that are
most likely to govern the flood generation process (Benson, 1962). It has been noted that
the method can give design flood estimates that do not vary smoothly with ARI; however,
hydrological judgment can be exercised in situations such as these where flood frequency
curves can be adjusted to increase smoothly with ARI.
As an alternative to the QRT, the parameters of a probability distribution can be regressed
against the explanatory variables (Tasker and Stedinger, 1989; Madsen et al., 2002; Reis et
al., 2005; Overeem et al., 2009). In the case of the LP3 distribution, regression equations
can be developed for the first three moments i.e. the mean, standard deviation and
skewness for logarithms of the annual maximum flood series. This method here is referred
to as the ‘parameter regression technique’ (PRT). There has been no detailed comparison of
the QRT and PRT for ungauged catchments.
The ordinary least squares regression (OLSR) estimator has traditionally been used by
hydrologists to estimate the regression coefficients () in regional hydrological models (for
CHAPTER 1
5
both the QRT and PRT). But in order for the OLSR model to be statistically efficient and
robust, the annual maximum flood series in the region must be uncorrelated, all the sites in
the region should have equal record length and all estimates of ARI year events have equal
variance. Since the annual maximum flow data in a region do not generally satisfy these
criteria, the assumption that the model residual errors in OLSR are homoscedastic is
violated.
Stedinger and Tasker (1985, 1986) developed a generalised least squares regression
(GLSR) model for regional hydrologic regression. The important difference in the GLSR
from the OLSR models lies in the development and partitioning of the covariance matrix of
the errors. The GLSR model of Stedinger and Tasker (1985) assumes that the total error
results from two sources: model errors and sampling errors (Tasker and Stedinger, 1989;
Pandey and Nguyen, 1999; Griffis and Stedinger, 2007; Gruber and Stedinger, 2008 and
Micevski and Kuczera, 2009). This is due to the fact that record lengths vary significantly
from site to site and that the flood data are cross correlated spatially. The GLSR procedure
can result in notable improvements in the precision with which the coefficients of regional
hydrologic regression models can be estimated.
Furthermore, Reis et al. (2003 and 2005) introduced a Bayesian approach to the
coefficients estimation for the GLSR (BGLSR) regional regression model developed by
Stedinger and Tasker (1985) for hydrological analysis. The results presented in Reis et al.
(2005) show that for cases in which the model error variance is small compared to
sampling error of the at–site estimates, which is often the case for regionalisation of a
shape parameter, the Bayesian estimator provides a more reasonable and generally less
biased estimates of the model error variance than the method of moments and maximum
likelihood estimators. The Bayesian approach can also provide a realistic description of the
possible values of the model error variance (Reis et al., 2005; Micevski and Kuczera, 2009;
Haddad et al., 2012 and Haddad and Rahman, 2012). It is advantageous to provide a full
posterior distribution of the quantity of interest (flood statistic, e.g. mean flood and flood
quantile) which is done by the Bayesian approach as compared to classical methods which
usually give a point estimate of the quantity of interest (Congdon, 2001).
Given the above advantages, the BGLSR is applied in this thesis using both the QRT and
PRT estimation techniques. This is carried out in a ROI framework.
CHAPTER 1
6
Validation is generally used to assess a model’s performance in hydrologic regression
analyses (Sun et al., 2011 and Tsakiris et al., 2011). The validation procedure has some
appealing and important properties, for instance, it assists in the selection of an appropriate
model according to its prediction ability for the gauged sites, while at the same time it
evaluates the prediction ability of the model for possible ungauged catchments. In the most
commonly adopted validation approach, a fixed percentage of the data (e.g. 10% or 20%) is
left out while building the model, and then the developed model is tested on the left out
data (Stone, 1974; Michaelsen, 1987 and Xu et al., 2005). This type of ‘split sample’
validation approach has limitations, as it often provides inadequate validation when the full
data set is not used in the validation following a random and unbiased fashion. To make use
of all the available sites in the validation using a more efficient and random manner, two
validation approaches are tested in this thesis, which are the leave-one-out (LOO) and
Monte Carlo Cross validation (MCCV) techniques (Xu and Liang, 2001; Xu et al., 2005
and Sun et al., 2011). Both the LOO and MCCV validation techniques are used with real
and simulated flood datasets in the frameworks of the commonly applied OLSR approach
and the more powerful GLSR approach.
Large to rare flood frequency analysis is a remarkably challenging task. One often needs to
estimate floods with an annual exceedance probability (AEP) much smaller than 1%, while
the streamflow record lengths are usually much shorter, being between 20 and 100 years in
most places in Australia. For example, average record lengths in the Australian RFFA
database is around 33 years, as reported in Rahman et al. (2009). Hence, it is generally the
case that the required flood magnitudes in the rare range have hardly been recorded,
meaning significant extrapolations from the available flood data are needed. It is therefore
not surprising that a suitable statistical approach is required to estimate large to rare floods
with a reasonable degree of consistency. This thesis therefore proposes a new large flood
regionalisation model, which also takes into account inter-site dependence (i.e. the number
of independent sites (Ne), which reduces the net information available for regional
analysis).
CHAPTER 1
7
1.3 THE NEED FOR THIS RESEARCH
Australia is a large continent with many streams; many of which are ungauged or have little
recorded flood data. For example, out of the 12 drainage divisions in Australia, seven do
not have a single stream with 20 or more years of recorded flood data (Vogel et al., 1993).
Therefore, RFFA techniques are quite important for Australia, as they can provide
reasonably accurate design flood estimation in these ungauged or poorly gauged
catchments. The sizing of minor hydraulic structures such as culverts, farm dam and
embankments in small ungauged catchments is a common task faced by practising
engineers. The average amount spent on these projects per year was estimated at
approximately $250 million as at 1985 (Flavell, 1985; Pilgrim, 1986); this is equivalent to
about $750 million per annum in 2012 (based on long term CPI series for Australian capital
cities, ABS, 2012).
Australian Rainfall Runoff, Book 4 (I. E. Aust., 1987) states that almost 50% of Australia’s
annual expenditure on projects requiring design flood estimation is on small to medium
sized ungauged catchments. These small catchments typically have an upper limit of 25
km2, while medium sized catchments have an upper limit of 1000 km2 (I. E. Aust., 1987).
Given the economic significance, the design flood estimates in small to medium ungauged
catchments need to be as accurate as possible, since under and over-estimation is associated
with higher flood damage costs and increased construction costs, respectively. Both of
these situations are undesirable.
There have been many RFFA techniques which have been proposed and used over the
years. It is well understood amongst the researchers in hydrology that some of these
approaches (such as the PRM in ARR 1987) is not based on hydrologically and statistically
meaningful rationale. Most of these methods are likely to introduce significant error in
flood quantile estimates. As there are flaws in these empirical approaches, further research
is needed to develop more reliable alternatives that can provide more accurate design flood
estimates along with estimation uncertainty.
Currently within Australia there is no one universally accepted RFFA method that can be
applied with confidence; instead, there are many local approaches, which have hardly been
CHAPTER 1
8
rigorously validated. For example, ARR (I. E. Aust., 1987) have made recommendations
that vary from state to state; these being the PRM, IFM and a variety of other empirical
approaches such as the synthetic unit hydrograph and the Main Road’s methods for the
State of Queensland. Since ARR came out in 1987, there have been some notable
advancements in at–site and RFFA (see Chapter 2 for more details) in Australia and
internationally (e.g. Stedinger and Tasker, 1985 and 1986, Tasker and Stedinger, 1989;
Kuczera, 1999a; Bates et al., 1998; Rahman, 2005 and Micevski and Kuczera, 2009). There
have been a number of studies that have dealt with different forms of RFFA and ways of
reducing errors in quantile estimates (e.g. Stedinger and Tasker, 1985 and 1986, Tasker et
al., 1996; Burn 1990 and 1990b; Zringi and Burn, 1996, Reis et al., 2005 and Griffis and
Stedinger, 2007). These methods essentially deal with ways to increase sample size (by
pooling hydrological data), reduce heterogeneity and sampling errors. In Australia, there is
also the extra benefit of having over 20 years of additional streamflow data (since the
publication of ARR1987) at many gauged sites that can be incorporated into the new RFFA
techniques. This is likely to reduce uncertainty in design flood estimates for the ungauged
catchments.
It can therefore be stated that there is a need for the development and testing of new RFFA
methods using the most updated flood data for Australia. This thesis embarks on these
tasks, which is likely to form the scientific basis of recommending new RFFA methods in
the upcoming revision of the ARR.
1.4 RESEARCH QUESTIONS
This thesis, in particular, examines the following research questions in the context of
RFFA:
1. How to reduce the uncertainty in flood data to be used in RFFA modelling by
employing rigorous data preparation and checking techniques?
2. How to deal with a high level of regional heterogeneity in RFFA (found by other
researchers in Australia)?
3. How to form acceptable regions in Australia where the degree of heterogeneity has
been found to be quite high?
CHAPTER 1
9
4. Whether regression-based approaches can be adopted in Australia to develop
statistically sound regional flood estimation models?
5. Whether the use of sophisticated statistical techniques such as BGLSR combined
with ROI can help to reduce the uncertainty in design flood estimates and thereby to
form the basis of uncertainty estimation in RFFA?
6. How a more rigorous validation approach (LOO or MCCV) can be applied with
RFFA methods?
7. How a new RFFA method can be developed for design flood estimation in the large
to rare flood ranges that explicitly accounts for spatial dependence in the annual
maximum flood series data and that can also be applied relatively easily in practice.
1.5 MAJOR TASKS
The research questions presented in Section 1.4 are answered/ investigated in this thesis by
undertaking the following major tasks:
(i) Prepare a critical literature review to ascertain the current state of knowledge in
RFFA techniques with a focus to identify gaps and limitations in the current
research and thereby to formulate research questions to be investigated in this
thesis.
(ii) Prepare an Australian national flood and catchment database that can be used in
the proposed research which mostly satisfies the principal assumptions in
RFFA.
(iii) Develop the regression based RFFA techniques such as BGLSR-QRT and
BGLSR-PRT for the design flood estimation in the small to frequent flood
range (ARIs of 2 to 100 years). Also, compare the BGLSR-QRT and BGLSR-
PRT methods in the fixed region and ROI frameworks.
(iv) Compare two validation approaches, LOO and MCCV, thereby assessing their
applicability to RFFA.
(v) Develop the LFRM for regional flood estimation in the large to rare flood
ranges (ARIs of 100 to 2000 years) using a comprehensive Australian dataset.
CHAPTER 1
10
(vi) Develop a generalised spatial dependence model to account for the inter-site
dependence (also known as spatial dependence) of annual maximum flood
series data in the application of the LFRM. Benchmark the developed LFRM
using a split-sample validation and by comparing it with the results from
alternative RFFA methods.
1.6 CONTRIBUTIONS OF THIS RESEARCH TO THE UNDERSTANDING
OF THE RFFA PROBLEM
This thesis attempts to make best use of the available streamflow data by developing
efficient regional data pooling methods and high-end statistical techniques. This focuses on
building a quality-controlled database as well as development of appropriate model
validation techniques. It develops a Bayesian GLS regression procedure with ROI approach
to tackle the excessive regional heterogeneity and to deliver more efficient parameter
estimation techniques of the adopted regional prediction equations. This thesis covers the
frequent and rare flood estimation problems ranging from 2 to 1000 years ARI, which can
possibly be extended to 2000 years ARI.
The thesis has made a notable contribution in the regional flood frequency analysis
research field as evidenced by the publication of 9 refereed journal papers (two in former
ERA ranked A*, two in A, 5 in B category journals). These published papers are listed in
Appendix A.
1.7 OUTLINE OF THE THESIS AND CHAPTER INTRODUCTIONS
The investigations carried out in this research are presented across 9 chapters, as described
below.
Chapter 1 gives a brief introduction to the overall study, highlighting the background and
need for this research. The research questions to be investigated and major tasks to be
undertaken to answer the research questions are also identified.
Chapter 2 provides a critical literature review on the various aspects of RFFA. On the onset
of this chapter, the basic issues related to assumptions in flood frequency analysis,
CHAPTER 1
11
distributional choices, regional homogeneity and spatial dependence are discussed.
Different RFFA methods which include the IFM, Station-year approach, Bayesian and
Monte Carlo methods and the PRM are presented. The QRT and PRT are discussed in
more detail with more emphasis on GLSR and BGLSR. The ROI approach is also critically
reviewed in relation to its use in previous applications and its relevance to this study. The
second part of this chapter discusses model validation in hydrological regression analysis.
A brief history of model validation is presented from a wide range of statistical applications
along with previous applications in the hydrological field. Finally, a brief history of large
flood estimation is given with examples of some of the methods currently used in Australia
and internationally. Overall, this chapter gives a summary of the merits and disadvantages
of each approach, thereby laying the foundation for the proposed research.
Chapter 3 describes the statistical techniques adopted in this thesis for the estimation of
design floods in the small to medium (frequent) ARI range and for the validation of
regional hydrological regression models. On the onset of this chapter a flow chart is
provided which illustrates the statistical techniques used in the thesis. Estimation of at-site
flood frequency is outlined using the LP3 distribution in a Bayesian framework. The
classical formulation of the GLSR problem found in the Econometrics field is presented to
provide an overview of the method. The chapter then goes on to provide the formulations
of the GLSR model by Stedinger and Tasker (1985 and 1986) for use in hydrological
regression analysis. The Bayesian methodology is outlined in greater detail for use with the
GLSR approach, hence the classical Bayesian formulation is also summarised.
The quasi-analytic Bayesian approach as outlined by Reis et al. (2005) for the
regionalisation of shape parameters is expanded on by developing a BGLSR model for the
regionalisation of quantiles and parameters of the LP3 distribution (QRT and PRT,
respectively). This methodology includes formulation of the likelihood function, the prior
distributions of the β coefficients and the model error variance of the regression model for
the QRT and PRT. Setting up the error covariance matrices, which are vital for the solution
of the BGLSR equations, are also presented. The steps and formulation involved in
selecting the best predictor variables for use with the BGLSR are outlined. The ROI
framework is then described in the light of its application for regionalising the parameters
and quantiles of the LP3 distribution. All the statistical diagnostics and formulation
regarding the residual analysis are also outlined in sufficient detail, along with the
CHAPTER 1
12
statistical measures of model performance. Also, a step by step framework for regional
uncertainty analysis is presented for obtaining confidence limits with regional flood
estimates.
In the second part of Chapter 3 the mathematical and statistical techniques related to model
validation for use with regional hydrological regression is outlined. Firstly the hydrological
regression problem is defined. The formulations regarding the LOO and MCCV validation
techniques are derived. Finally, the details regarding the statistical techniques for
generating the simulated data for testing with the LOO and MCCV are discussed in detail.
The assembly of streamflow data is an important step in any RFFA study. Chapter 4
describes various aspects of streamflow data collation such as selection of the study
catchments, filling of gaps in the annual maximum flood data series, testing the data for
any suspected trends (as one of the assumptions of flood frequency analysis is that the data
must exhibit stationarity and be homogenous), exploring rating curve errors associated with
the annual maximum flood data (flood data often has notable error associated with it, hence
identification of this is important) and checking for outliers (both low and high outliers
may be present in annual maximum flood data, these should be identified and treated
accordingly). This chapter also presents the final set of catchments to be used in this thesis.
Chapter 4 also covers the selection of the climatic and physical catchment characteristics
variables that govern flood generation process and can be used in RFFA models.
Chapter 5 integrates the techniques provided in Chapter 3 into a practical BGLSR regional
hydrologic regression framework, which is able to address the issues relevant to the
estimation of flood quantiles and statistics in an efficient manner. Chapter 5 also presents
the results associated with the RFFA for small to medium range ARIs looking at the
differences between fixed region and ROI frameworks for both the BGLSR-QRT and PRT
methods. The results are illustrated for the states of Tasmania, New South Wales (NSW),
Victoria and Queensland. The advantages of the BGLSR-ROI are outlined in sufficient
detail.
CHAPTER 1
13
Chapter 6 presents the results of the comparison of two model validation techniques, the
LOO and MCCV in a hydrological regression framework for the state of NSW. Both the
OLSR and GLSR are applied to simulated and real datasets. This chapter also illustrates
through detailed examples the overall advantages and disadvantages of the proposed
methods for model selection and validation in RFFA.
Chapter 7 presents the estimation of floods in the large to rare flood range. The
methodology, detailed investigation and results associated with the LFRM are discussed in
detial. Chapter 7 begins with a brief discussion on the LFRM concept which is based on the
Station-year approach. The issue of inter-site dependence in general is discussed in the
light of the application of the LFRM. The chapter also discusses the comprehensive
Australian annual maximum dataset used for the analysis. The issues of identification of a
probability distribution and homogeneity in the context of LFRM are investigated and
discussed. The theory and development of the LFRM is outlined assuming spatial
independence initially. This chapter also outlines a methodology for deriving simulated
data which is used for estimating the effective number of independent sites, as it was
recognised that the observed data had limitations relating to sampling variability and
homogeneity issues.
Chapter 8 illustrates how the effect of inter-site dependence is tackled by introducing the
‘effective number of sites (Ne) concept’. The steps and formulation needed for determining
the typical degree of spatial dependence in a network or region is discussed in detail. The
estimation of Ne is then derived assuming a generalised extreme value (GEV) distribution
with a simple model that ignores possible variation with ARI.
The results are then discussed and compared in detail for Ne for both the real and simulated
data sets. Both the derived results helped to establish the behaviour of Ne in a network and
region for the analysis. The procedure for generalising the spatial dependence is provided
along with the comprehensive results from this investigation. As such the LFRM was
revisited using the newly developed spatial dependence model and applied to ungauged
catchments by developing prediction equations using BGLSR for the mean flood and
coefficient of variation of annual maximum floods.
CHAPTER 1
14
A summary, conclusion and recommendations for further research are presented in Chapter
9.
There are four (4) appendices, as follows. Appendix A presents the refereed journal papers
that have been published or that are under review based on the research presented in this
thesis. Appendix B presents additional results associated with Chapter 5. Appendix C
presents additional results associated with Chapter 7 and 8 while Appendix D provides
some extra details on the homogeneity tests used in Chapter 7.
CHAPTER 2
15
CHAPTER 2: REVIEW OF REGIONAL FLOOD FREQUENCY
ANALYSIS TECHNIQUES, MODEL VALIDATION AND LARGE
FLOODS
2.1 GENERAL
The aim of this chapter is to review previous studies on regional flood frequency analysis
(RFFA) techniques with a particular emphasis on the estimation of flood quantiles in the
range of average recurrence intervals (ARIs) of 2 – 100 years in relation to the quantile and
parameter regression techniques. The concepts of fixed region and region of influence
(ROI) approaches are discussed and past applications are presented. This chapter also
reviews previous studies on the validation of regression models especially in the area of
hydrology, with an emphasis in the area of hydrological regression. Finally this chapter
also reviews past studies in the area of large to rare flood estimation. Both the advantages
and limitations of the methods presented are also outlined.
At the beginning, the basic issues on RFFA such as regional homogeneity, inter-site
dependence, and distributional choices are reviewed. A brief discussion is then presented
on identifying homogenous regions based on annual maximum flood series. The review of
RFFA methods as outlined above is then presented. A summary of the findings from this
review is given at the end of the chapter.
2.2 BASIC ISSUES
2.2.1 REGIONAL FLOOD FREQUENCY ANALYSIS
The availability of streamflow data is an important aspect in any flood frequency analysis.
The estimation of probability of occurrence of floods in the credible limit range (ARIs 2 –
100 years) and beyond credible limit (large to rare floods) is an extrapolation based on
limited recorded flood data. Thus, the larger the recorded data set, the more accurate the
estimates will be. From a statistical view point, estimation from a small sample may give
unreasonable or physically unrealistic parameter estimates, especially for distributions with
a large number of parameters (three or more). Large variations associated with small
sample sizes cause the estimates to be uncertain and biased. In practice, however, data may
be limited or in some cases may not be available for a site. In such situations, RFFA is
most useful.
CHAPTER 2
16
RFFA is a technique of transferring information from gauged suites to ungauged sites.
RFFA serves two purposes. For sites where data are not available, the analysis is based on
regional data (Cunnane, 1989). For sites with limited data, the joint use of data recorded at
a site, called at-site data, and regional data from a number of stations in a region provides
sufficient information to enable a probability distribution to be used with greater reliability.
This type of analysis represents a substitution of space for time where data from different
locations in a region are used to compensate for short records at a single site (National
Research Council, 1988; Stedinger et al., 1993).
2.2.2 REGIONAL HOMOGENEITY
RFFA is based on the concept of regional homogeneity which assumes that annual
maximum flood populations at several sites in a region are similar in statistical
characteristics and are not dependant on catchment size (Cunnane, 1989). Although this
assumption may not be strictly valid, it is convenient and effective in most applications.
One of the simplest RFFA procedures that have been used for a long time is the index flood
method (IFM). The key assumption in the IFM is that the distribution of floods at different
sites within a region is the same except for a site-specific scale or index flood factor.
Homogeneity in regards to the index flood relies on the concept that the standardised
regional flood peaks have a common probability distribution with identical parameter
values. The identification of homogenous regions is an elementary step in RFFA (Bates et
al., 1998). The application typically involves the allocation of an ungauged catchment to an
appropriate homogenous group and the prediction of flood quantiles using developed
models based on catchment characteristics (Bates et al., 1998). That is, the RFFA based on
homogenous regions can transfer the information from similar gauged catchments to
ungauged catchments to allow for flood prediction.
There have been many techniques developed which attempt to establish homogenous
regions. For example the probabilistic rational method (PRM) uses geographical contiguity
as an indication of homogeneity that is the catchments which are nearby to each other
should have similar runoff coefficients (I. E. Aust., 1987).
CHAPTER 2
17
Looking at homogeneity from a theoretical point of view, two catchments’ annual
maximum flood series may be treated as homogenous with respect to flood behaviour if
they both satisfy two criteria: the inputs (such as rainfall) to the hydrological systems are
identical, and the climatic and physical characteristics changing the input to flood peak are
the same. No two catchments can satisfy these criteria perfectly based on the fact that each
catchment has unique physical characteristics and that each catchment has different
climatic inputs. In the search for practical homogeneity, one has to make decisions on the
degree of similarity or dissimilarity that is acceptable to identify a cut-off point where a
region is acceptably homogenous or heterogeneous, in consideration of the practical
applications of the RFFA techniques.
In defining homogenous regions for use in RFFA, a balance has to be made between
including more sites for increased information and maintaining an acceptable level of
homogeneity. In most situations when more sites are added to a region, certainly more
information is gained about the flood regime; however sites that are hydrologically
dissimilar can increase the heterogeneity in the region.
2.2.3 INTER – SITE DEPENDENCE
Some RFFA methods make use of inter–site dependence (see also section 2.4.1) while
others do not. Inter–site dependence as reported by Cunnane (1988) states that streamflow
data points across a region will show similar behaviour within any given timeframe. This
means that;
1) In some years the annual maximum flows at all sites are due to a single widespread
meteorological event.
2) In relatively dry years, peak flows are generally low over the entire region, in which
case all annual maxima will be low.
To be able to counteract these trends in RFFA, previous studies have indicated that a
concurrent record of sufficient length should be adopted (Stedinger, 1983).
Inter-site dependence can be viewed as disadvantageous, as it reduces the value of
additional information for regional analysis, i.e. inter-site dependence limits the increase of
information from an increase in the number of stations in a region. On the other hand, it is
beneficial to the derivation of flood quantiles for ungauged sites, as it allows transfer of
CHAPTER 2
18
information from gauged to ungauged sites. The effects of inter-site dependence on large
flood estimation are discussed in more detail in Chapter 7.
2.2.4 DISTRIBUTIONAL CHOICES
Selection of an appropriate probability distribution to be used in flood frequency analysis is
of prime importance in at-site and RFFA. It has also been a topic of interest for a long time
and one that is filled with controversies (Bobée et al., 1993). Selecting a probability
distribution has received widespread attention by many researchers. The recent literature in
this field is wide and varied and has been characterised by a proliferation of mathematical
models, which lacks in theoretical justification but is applied in a simplistic manner to
estimate flood flows. Benson (1968) and NERC (1975) devote considerable attention to
this problem. Cunnane (1989) summarised the distributions commonly used in hydrology,
mentioning 14 different distributions. Kidson and Richards (2005) present an informative
summary on the assumptions and alternatives for distributional choices. They cover aspects
such as data choice, model choice and alternatives and the inclusion of historical and
paleoflood data see (Stedinger and Cohn, 1986; Jin and Stedinger, 1989; Pilon and
Adamowski, 1993; Salas et al., 1994; Cohn et al., 1997; Kuczera, 1999; Martins and
Stedinger, 2001; O’Connell et al., 2002; and Reis and Stedinger, 2005). These studies
generally show that the use of historical information can be of great value in the reduction
of the uncertainty in flood quantile estimates.
In some countries, a common distribution has been recommended to achieve uniformity
between different design agencies. The U.S.A. Interagency Advisory Committee on Water
Data (IACWD, 1982) and the Institution of Engineers Australia (I. E. Aust., 1987)
recommend the log Pearson type 3 (LP3) distribution for use in the United States and
Australia, respectively. Other distributions that have received considerable attention
include the extreme value types 1, 2, 3 (EV1, 2 or 3), generalised extreme value (GEV)
(NERC, 1975), wakeby (Houghton, 1978), generalised pareto (GPA) (Smith, 1987), two-
component extreme value (Rossi et al., 1984) and the log-logistic distribution (Ahmad et
al., 1988).
The use of a standard distribution has been criticised by Wallis and Wood (1985) and
Potter and Lettenmaier (1990). They argue that a reassessment of the use of the LP3
distribution for practical flood design is overdue. Vogel et al. (1993) studied the suitability
CHAPTER 2
19
of a number of distributions (including the LP3) for Australia. They found that the GEV
and wakeby distributions provide the best approximation to flood flow data in the regions
of Australia that are dominated by rainfall during the winter months; for the remainder of
the continent, the GPA and wakeby distributions provide better approximations. For the
same data set, the LP3 performed satisfactorily, but not as well as either the GEV or GPA
distribution. The distributions that have attracted the most interest as possible alternatives
to the LP3 are the GEV and wakeby (Bates, 1994). Studies by Rahman et al. (1999b) and
Haddad and Rahman (2008) showed that GEV-LH moments method provide better results
than the LP3 distribution in South–east Australia in particular for New South Wales (NSW)
and Victoria.
Laio et al. (2009) presented a procedure to identify suitable probability distributions for
hydrological extremes. The objective of this study was to verify the most appropriate
distribution using various goodness-of-fit tests. This study used real (data from the United
Kingdom) and simulated data. It was found that no distribution gave the best fit, however
the model selection tests were a step forward to identifying the most suitable probability
distribution. More recent studies by Haddad and Rahman (2011) – (Journal paper can be
found in Appendix A) compared seven probability distributions (EV1, log normal (LN),
normal (NORM), GEV, Pearson type 3 (P3), LP3 and EV2) for the state of Tasmania.
Using the model selection based on the Aikake information criterion (AIC), Bayesian
information criterion (BIC) and the modified Anderson Darling test (AD) as outlined by
Laio et al. (2009), they showed that the LN distribution with the Bayesian parameter fitting
procedure provided more reliable results in terms of bias and standard error than the
competing models for Tasmania.
2.3 METHODS FOR IDENTIFICATION OF HOMOGENEOUS REGIONS
The methods for obtaining homogenous regions are based on either geographical contiguity
or flood characteristics alone or catchment characteristics alone. The theoretical aspects,
limitations and associated problems with identification of homogenous regions based on
flood data (annual maximum series) are discussed below.
In this approach, the degree of homogeneity of a proposed group is judged on the basis of a
dimensionless coefficient of the annual maximum flood series, such as the coefficient of
variation (CV), coefficient of skewness (CS) or similar measures. Examples are given by
CHAPTER 2
20
Dalrymple (1960), Wiltshire (1986a), Acreman and Sinclair (1986), Vogel and Kroll
(1989), Chowdhury et al. (1991), Pilon and Adamowski (1992), Lu and Stedinger (1992),
Hosking and Wallis (1993) and Fill and Stedinger (1995a, b).
Dalrymple (1960) proposed a homogeneity test based on the sampling distribution of the
standardised 10 year annual maximum flow, assuming an EV1 distribution. Wiltshire
(1986a, b) presented a test based on the sampling distribution of CV to judge the degree of
homogeneity in a region. He tested the efficiency of the proposed test on simulated data
and concluded that “it is clear that the test in its present form is unsuitable for use in
assessing regional homogeneity”. Acreman and Sinclair (1986) used a likelihood ratio test
based on the assumption of an underlying GEV distribution.
Hosking and Wallis (1991, 1993) proposed a heterogeneity measure based on the L
moment ratios L coefficient of variation (LCV), L coefficient of skewness (LSK) and L
kurtosis (LKT). The advantages of this test are that it is based on L moments and not
distribution-specific like those mentioned above. This test has received considerable
attention since its inception (e.g. Pearson, 1991; Thomas and Olsen, 1992; Alila et al.,
1992; Guttman, 1993; Zrinji and Burn, 1996; Bates et al.,1998; Rahman et al.,1999b,
Kjeldsen and Rosbjerg, 2002; Madsen et al., 2002; DiBaldassarre et al., 2006; Castellarin et
al., 2007; Chebana and Ouarda, 2008 and Gaume et al., 2010),
Cunnane (1988) mentioned that identification of a homogeneous region is necessarily
based on statistical tests of hypothesis, the associated power of which, with currently
available amounts of hydrological data, is low. Thus it is not possible to divide, with great
assurance, a large number of catchments into homogeneous subgroups using flow records
with limited lengths. Indeed from an Australian perspective homogeneity cannot always be
satisfied (e.g. Haddad, 2008; Haddad and Rahman, 2012; Ishak et al., 2011 and Rahman,
1997). With the existence of large predictive uncertainty, short record lengths and the
heterogeneity that plagues Australian catchments, flood estimation methods that can deal
with heterogeneity and predictive uncertainty in an efficient manner are needed.
CHAPTER 2
21
2.4 REGIONAL FLOOD FREQUENCY ANALYSIS METHODS –
DIFFERENT APPROACHES
There are a number of RFFA methods based on streamflow data that have been reported.
Some of the most commonly used methods are discussed below.
2.4.1 INDEX FLOOD METHOD
The index flood method (IFM) is a regional frequency approach for transferring flood or
rainfall characteristics information from a group of gauged sites to an ungauged site of
interest (Dalrymple, 1960; Madsen et al., 2002 and Baldassarre et al., 2006). The
estimation of a flood quantile by the IFM can be expressed by:
TT ZQ (2.1)
Where is the scaling factor and is called the index flood, and TZ is a dimensionless
growth factor (or growth curve). In many cases the index flood is taken to be the mean of
the annual flood maximum flood series, which is a site specific value; while the growth
factor is assumed to be constant for the entire homogenous region under consideration.
In the IFM, the dimensionless regional growth curve is used to estimate ZT. The flood
quantile having an ARI of T year is then obtained from Equation 2.1. In the case of a
gauged site, the at-site mean flood is used in Equation 2.1; for an ungauged site, is
estimated using regional information. Equation in 2.1 is based on the following variables:
QT is the flood quantile at a site, with an ARI of T years;
ZT is the regional growth factor, which defines the frequency distribution common to all
the sites in a homogenous region; and
is known as the index flood, which is typically represented (in gauged catchments) by
the mean of the at–site annual maximum flood series. Being used as a scale parameter, it is
recognised as the term which dictates the difference in quantiles between individual sites
within the homogenous region.
CHAPTER 2
22
When the IFM is to be applied to the ungauged catchment case where there is no data
available the difficulty in estimating becomes evident. Estimation such as this is
typically performed via multiple regression between the mean annual flood now noted by
(Q ) and catchment and climatic characteristics (catchment characteristics) within the
region (e.g. Fill and Stedinger, 1998). The general form of this regression equation can be
expressed as:
Q aB C Db c d ... (2.2)
where B, C, D, … are catchment characteristics and a, b, c, d, … are parameters of the
regression equation estimated by either ordinary or generalised least squares regression
(OLSR and GLSR) (The GLSR method is discussed in more detail in section 2.5).
The IFM or index frequency approach (being applicable to both flood and rainfall
estimation) (e.g. Madsen et al., 2002 and DiBaldassarre et al., 2006) has been a popular
approach for estimating flood quantiles since 1960 (Dalrymple, 1960). The assumption is
made that the distribution of floods at different sites within a homogeneous region is the
same except for a site-specific scale or index flood factor. Homogeneity with regards to the
index flood relies on the concept that the standardised flood peaks from individual sites in
the region follow a common probability distribution with identical parameter values. From
all the methods to be discussed in this report, the IFM involves the strongest assumptions
on homogeneity.
Australian Rainfall and Runoff (ARR) (I. E Aust., 1987) did not favour the IFM as a design
flood estimation technique for Australia. The IFM had been criticised on the grounds that
the coefficient of variation of the flood series may vary approximately inversely with
catchment area, thus resulting in flatter flood frequency curves for larger catchments. This
had particularly been noticed in the case of humid catchments that differed greatly in size
(Dawdy, 1961; Benson, 1962; Riggs, 1973; Smith, 1992).
The IFM further developed in the late 1980’s is a vast improvement to the past
methodologies, which uses regional average values of LCV and LSK with the at-site mean
to fit a GEV or an alternative distribution (Hosking and Wallis, 1997). Hosking and Wallis
CHAPTER 2
23
(1993) demonstrate that this approach is efficient when the region is relatively
homogeneous and record lengths are relatively short. Alternatively, a regional GEV shape
parameter can be adopted based upon a regional average (Stedinger and Lu, 1995; Hosking
and Wallis, 1997 and Fill and Stedinger, 1998). This approach is more attractive than the
typical index frequency method when record lengths and regional heterogeneity increase,
but at-site data is sufficient to define the at-site LCV, but not long enough to resolve the
shape parameter (LSK). The efficiency of using either the regional value or the at-site
estimator clearly depends on the sample size. An obvious and natural solution is to
combine the at-site and the regional estimators based on the precision of each estimator.
This approach has been proposed before, for instance, Bulletin 17B (IAWCD, 1982)
recommends that a regional estimate of the shape parameter of the LP3 distribution be
combined with the at-site estimate, to obtain a more precise estimator (see for example,
Griffis and Stedinger, 2004). Similarly, Fill and Stedinger (1998) have proposed such an
extension to the original IFM.
More recent studies in Australia, (Bates et al., 1998; Rahman et al., 1999a), assigned
ungauged catchments to a particular homogenous group identified (through the use of L
moments, (Hosking and Wallis, 1993)) on the basis of catchment and climatic
characteristics as opposed to geographical proximity. However the deficiencies in this
approach were evident in that it needed 12 catchment/climatic descriptors to be used.
Therefore its practical use is somewhat limited by its complexity and the time needed to
gather the relevant data. On an international scale Fill and Stedinger, 1998 demonstrated
that the IFM can provide improved quantile estimation, when different sources of errors are
reduced by including explicitly for the varying sampling errors and inter-site correlation
from site to site in a region.
The use of IFM however in Australia is undermined by the great heterogeneity among
Australian catchments and any results obtained would be subject to substantial error.
Therefore a method is needed where the assumption of homogeneity may be reduced by
capturing the spatial variability from site to site within a region. This provides ground and
motivation to explore the quantile regression technique (QRT) for design flood estimation
in Australian conditions.
CHAPTER 2
24
2.4.2 STATION YEAR METHOD
The standardised Q values of all the sites in the region are treated as if they form a single
random sample of size n from a common parent population. The pooled standardised data
are then fitted to a suitable distribution, and ZT values are calculated. Since this method
ignores inter-site dependence, it may lead to greater uncertainty and bias, especially at large
return periods (Cunnane, 1988 and Nanadakumar et al., 1997). The issue of inter-site
dependence (see section 2.2.3) or spatial dependence is an issue that has been receiving a
lot of attention in the field of flood and rainfall estimation. The main issues being
researched are ways of (i) estimating spatial dependence based on the theory of max-stable
spatial processes (e.g. Cooley et al., 2006, 2010; Vannitsem and Naveau, 2007 and Vrac et
al., 2007 and Reich et al., 2012) and (ii) incorporating spatial dependence to estimate the
number of independent sites (Ne) in a region (e.g. Buishand, 1984; Hosking and Wallis,
1988; Dales and Reed, 1989; Nandakumar et al., 1997; Stewart et al., 1999; Nanadkumar et
al., 2000; Guse et al., 2009 and Svensson and Jones, 2010).
2.4.3 BAYESIAN ANALYSIS AND MONTE CARLO METHODS
Bayesian inference is another alternative to classical estimation methods such as the
method of moments and maximum likelihood. In Bayesian inference, the understanding of
the likelihood parameters has different values as described by a probability density function
(Reis and Stedinger, 2005). In Bayesian inference, the information in the data can be
represented by the entire likelihood function and also prior knowledge such as a numerical
estimate of the degree of belief or a researcher’s experience in a hypothesis before evidence
has been observed. The method then calculates a numerical estimate of the degree of belief
in the hypothesis after evidence has been observed. In flood frequency analysis parameter
estimation is made through the posterior distribution which is calculated using Bayes’
theorem which is the probability that a frequency function P has parameters , given that
we have observed the realisations d (defined in our data, any historical information, and
limits to be placed on analysis and threshold exceedances).
Bayes' theorem is given by Equation 2.3:
)(
)()./()/(
dP
PdPdP
(2.3)
CHAPTER 2
25
where P(|d) is the conditional probability of , given d (it is also called the posterior
probability because it is derived from or depends upon the specified value of d) and is the
result we are interested in. P() is the prior probability or marginal probability of (`prior'
in the sense that it does not take into account any information about d). P(d|) is the
conditional probability of d given and it is defined by choosing a distribution and
depending on the availability of historical data. P(d) is the marginal probability of d, and
acts as a normalising constant. Since complex models cannot be processed in closed form
in a Bayesian analysis, namely because of the extreme difficulty in computing the
normalisation factor P(d), simulation-based Monte Carlo techniques such as the MCMC
approach which include Metropilis-Hasting algorithm are used in this analysis. More
details about the Metropolis-Hastings algorithm can be found in Geman and Geman (1984),
Casella and George (1992), Metropolis et al., (1953) and Hastings (1970).
The use of Bayes’ theorem for combining prior and sample flood information was
introduced by Bernier (1967). Perrcchi and Rodriguez-Iturbe (1983) discussed some of the
problems associated with Bayesian model choices in hydrology. They also discussed the
use of prior information and tentative alternatives for improvements in Bayesian
hydrological analysis. Ashkanasy (1985) advocated that the use of Bayesian methods
would result in more reliable and credible flood frequency estimates. Bayesian methods in
flood frequency analysis have since then been adopted by many researchers (e.g. Wood and
Rodriguez-Iturbe 1975; Kuczera, 1982a, 1983a, b, 1999; Fortin et al. 1997; Kuczera and
Parent, 1998; Reis and Stedinger, 2005; Reis et al., 2005; Micevski and Kuczera, 2009;
Haddad et al., 2010b, 2012 and Haddad and Rahman, 2012 – the last 3 papers are based on
the research in this thesis and can be seen in Appendix A).
2.4.4 PROBABILISTIC RATIONAL METHOD AS USED IN AUSTRALIA
The rational method was introduced by Mulvaney (1851) and has been widely regarded as
a deterministic representation of the flood generated from an individual storm. However,
the rational method recommended in Australian Rainfall and Runoff (ARR) (I. E. Aust.,
1987; Pilgrim and Cordery, 1993), is based on a probabilistic approach for use in
estimating design floods. This probabilistic rational method (PRM) is represented by:
Q C I AT T t Tc 0 278. , (2.4)
CHAPTER 2
26
where QT is the peak flow rate (m3/s) for an ARI of T years; CT is runoff coefficient
(dimensionless) for ARI of T years; I t Tc , is average rainfall intensity (mm/h) for a design
duration of time of concentration tc hours and ARI of T years; and A is the catchment area
(km2).
The method may be regarded as a regional model, with design rainfall intensity I t Tc , and
catchment area A as independent variables. The runoff coefficient CT is a factor which
lumps the effects of climatic and physical characteristics, other than catchment area and
rainfall intensity. It is noteworthy that in ARR 1987 the values of CT were estimated using
conventional moment estimates from flow records of limited lengths e.g. some sites had
only 10 years of records. Since conventional moment estimates are largely affected by
sampling variability and extremes in the data, a higher degree of uncertainty in quantile
estimation is likely to arise due to CT reported in the ARR 1987. The mapping and use of
runoff coefficients are based on the assumption of geographical contiguity, an assumption
that is unlikely to be satisfied. Pegram (2002) and French (2002) also discussed the
strengths and weaknesses of the PRM.
Rahman and Hollerbach (2003) investigated the physical significance of runoff coefficients
and assessed the extent of uncertainty of design flood estimates obtained by the PRM. By
following the method of derivation in ARR, runoff coefficients were estimated for 104
gauged catchments in South east Australia. The mapping of these C10 coefficients onto a
suitable map of the area indicated that C10 coefficients show little spatial coherence. The C
coefficients are mapped according to the position of the gauging station and some
interpolation is then required for areas where there is no data so that the contours can be
developed. The error introduced into the contours is through the interpolation technique;
this is due to the fact that some regions will be exposed to greater spatial changes in
physical topography and other factors which directly affects the C10 coefficients. In a very
similar fashion Rahman and Holerbach (2003) also stated that while nearby catchments
shows similar meteorological characteristics, they may possess quite dissimilar physical
characteristics, which clearly indicates that the method of simple linear interpolation over a
geographical space on the map of C10 in ARR (I. E Aust., 1987) has little validity.
CHAPTER 2
27
More recently, Rahman et al. 2009 and 2011a, b (2011b given in Appendix A) conducted a
study comparing the PRM to the GLSR based QRT using 107 catchments in the state of
NSW, Australia. The comparison was undertaken using a leave-one-out and split-sample
validation approach examining specific features of each RFFA method. The conclusions
that were drawn from this study were that the QRT-GLSR outperformed the PRM based on
a range of evaluation statistics. Importantly it was found that the PRM and QRT-GLSR did
not perform poorly for the smaller catchments used in the study. Overall, the QRT-GLSR
was advantageous over the PRM in that no assumptions are needed regarding runoff
coefficients as with the PRM. The QRT-GLSR also explicitly differentiates between
sampling and model error thus allowing flexibility for further uncertainty analysis, whereas
the PRM lacks scope for further development.
2.5 QUANTILE AND PARAMETER REGRESSION TECHNIQUES
2.5.1 INTRODUCTION
The United States Geological Survey (USGS) proposed a QRT where a large number of
gauged catchments are selected from a region and flood quantiles are estimated from
recorded streamflow data, which are then regressed against catchment variables that are
most likely to govern the flood generation process. Studies by Benson (1962) suggested
that T-year flood peak discharges could be estimated directly using catchment
characteristics data by multiple regression analysis.
The QRT can be expressed as follows:
...dcb
T DCaBQ (2.5)
Where B, C, D, … are catchment characteristics variables and TQ is the flood magnitude
with T year ARI (flood quantile), and a, b, c, … are regression coefficients.
It has been noted the method can give design flood estimates that do not vary smoothly
with ARI; however, hydrological judgment can be exercised in situations such as these
when flood frequency curves can be adjusted to increase smoothly with T. There have been
various techniques and many applications of regression models that have been adopted for
hydrological regression. Most of these methods are derived from the methodology set out
by the USGS as described above.
CHAPTER 2
28
As an alternative to the QRT, the parameters of a probability distribution can be regressed
against the explanatory variables (Tasker and Stedinger, 1989; Madsen et al., 2002). In the
case of the LP3 distribution, regression equations can be developed for the first three
moments i.e. the mean, standard deviation and skewness of the logarithms of annual
maximum flood series. For an ungauged catchment, these equations can then be used to
predict the mean, standard deviation and skewness to fit an LP3 distribution. This method
here is referred to as ‘parameter regression technique’ (PRT). However, there has been
little research on the applicability of the PRT as compared to the QRT in RFFA.
Regionalising the parameters of a probability distribution (which is referred to as PRT in
this study also offers three significant advantages over the QRT:
1. It ensures flood quantiles increase smoothly with increasing ARI, an outcome that may
not always be achieved with the QRT. The flood quantiles obtained from the PRT may
also be used to determine whether the flood quantiles derived from the QRT provides
similar and consistent results.
2. It is straightforward to combine any at-site flood information with regional estimates
using the approach described by Micevski and Kuczera (2009) to produce more
accurate quantile estimates; and
3. It permits quantiles to be estimated for any ARI within the limits of the developed
RFFA method.
Cunnane (1988) also reviewed methods that use regional information in the estimation of
hydrologic statistics. One versatile approach employs regional information to derive a
relationship between streamflow statistics and catchment characteristics using regional
regression analysis such as the QRT or PRT. Such regional regression methods have been
widely used to estimate hydrologic statistics at ungauged sites (Benson and Matalas, 1967;
Matalas and Gilroy, 1968; Thomas and Benson, 1970; Moss and Karlinger, 1974; Jennings
et al., 1993), and to increase the precision of the statistic of interest at sites with short
record lengths by adding regional information (Kuczera, 1982a; Stedinger, 1983; Madsen
and Rosbjerg, 1997; Fill and Stedinger, 1998; Martins and Stedinger, 2000; Reis and
Stedinger, 2003). Regional regression models such as the QRT or PRT aim to explain
spatial variability of the hydrologic statistic by relating it to catchment variables, such as
catchment area, mainstream slope, mean annual rainfall and percentage of forest cover.
CHAPTER 2
29
The OLSR estimator has traditionally been used by hydrologists to estimate the regression
coefficients in regional hydrological models. But in order for the OLSR model to be
statistically efficient and robust, the annual maximum flood series in the region must be
uncorrelated, all the sites in the region should have equal record length and all estimates of
T year events have equal variance. Since the annual maximum flow data in a region do not
generally satisfy these criteria, the assumption that the model residual errors in OLSR are
homoscedastic is violated and the OLSR approach can provide very distorted estimates of
the model’s predictive precision (model error) and the precision with which the regression
model parameters are being estimated (Stedinger and Tasker, 1985).
To overcome the above problems in OLSR, Stedinger and Tasker (1985) proposed the
GLSR procedure which can result in remarkable improvements in the precision with which
the parameters of regional hydrologic regression models can be estimated, in particular
when the record length varies widely from site to site. In the GLSR model, the assumptions
of equal variance of the T year events and zero cross-correlation for concurrent flows are
relaxed.
The GLSR procedure as described by Stedinger and Tasker (1985) and Tasker and
Stedinger (1989) require an estimate of the covariance matrix of residual errors )(ˆ Y for
the hydrologic statistic of interest.
2.5.2 GENERALISED LEAST SQUARES AND WEIGHTED LEAST SQUARES
REGRESSION
As discussed above, the coefficients of regional regression models have generally been
estimated using the OLSR procedure. However, regionalisation using hydrological data
violates the assumption that the residual errors associated with the individual observations
are homoscedastic and independently distributed (Stedinger and Tasker, 1985). In the case
of hydrological data, variations in streamflow record length and cross–correlation among
concurrent flows result in estimates of the T year events which vary in precision. Matalas
and Benson (1961), Matalas and Gilroy (1968), Hardison (1971), Moss and Karlinger
(1974) Tasker and Moss (1979) have examined the statistical properties of the OLSR
procedures.
CHAPTER 2
30
As shown in the studies cited above, OLSR estimates of the standard error of prediction
and the estimated parameters are generally biased under many situations. Weighted and
GLSR techniques were developed to deal with situations like those encountered in
hydrology where a regression model residuals are heteroscedastic and perhaps cross–
correlated (Draper and Smith, 1981; Johnston, 1972). Tasker (1980) used a weighted least
squares regression (WLSR) procedure to account for unequal record lengths. Marin (1983)
and Kuczera (1982a, b, 1983a) developed an empirical Bayesian methodology, which can
deal with these issues as well.
An obstacle to the use of WLSR and GLSR procedures with hydrological data is the need
to provide an estimate of the covariance matrix of residual errors; this covariance matrix is
a function of the precision with which the true model can predict values of the streamflow
statistics of concern as well as the sampling error in the available estimates of that statistic.
The discussions and examples in the works by Tasker (1980) and Kuczera (1983b)
illustrate the difficulties associated with the estimation of this matrix.
Stedinger and Tasker (1985), in a Monte Carlo simulation with synthetically generated
flow sequences, presented a comparison of the performance of the OLSR procedure with
that of the GLSR one. In situations where the available streamflow records at gauged sites
are of different and widely varying length and concurrent flows at different sites are cross-
correlated, the GLSR procedure provided more accurate parameter estimates, better
estimates of the accuracy with the regression models coefficients being estimated, and
almost unbiased estimates of the variance of the underlying regression model residuals. A
simpler WLSR procedure neglects the cross correlations among concurrent flows. The
WLSR algorithm has been shown to do as well as the GLSR procedure when the cross
correlation among concurrent flows are relatively modest.
2.5.3 PREVIOUS APPLICATION OF GENERALISED LEAST SQUARES AND
BAYESIAN GENERALISED LEAST SQUARES REGRESSION
The GLSR procedure introduced by Stedinger and Tasker (1985 and 1986) has been
extensively used nationally and internationally to estimate the coefficients of regional
regression models of hydrologic statistics (WMO, 1994; Robson and Reed, 1999). Tasker
et al., 1986, 1996, Tasker and Stedinger, 1987; Rosbjerg and Madsen, 1994; Pandey and
Nguyen, 1999; Kjeldsen and Rosbjerg 2002; Feaster and Tasker, 2002; Law and Tasker,
CHAPTER 2
31
2003; Griffis and Stedinger, 2007; Rosbjerg, 2007; Kjeldsen and Jones, 2009 and Haddad
et al., 2011a) have all applied the GLSR for regionalisation of flood quantiles. Madsen and
Rosbjerg (1997) employed the GLSR procedure to obtain regional estimates of the
parameters (i.e. index and LCV) of a GPA distribution employed as a prior distribution in
an empirical Bayesian procedure for flood frequency analysis in New Zealand. Tasker and
Driver (1988) developed regression equations using GLSR to predict mean loads for many
chemical constituents at unguaged sites. GLSR has also been used as the basis of
hydrologic network design (Moss and Tasker, 1991).
Griffis and Stedinger (2007) looked at the GLSR method in more detail. Previous studies
by the US Geological Survey using the LP3 distribution have neglected the impact of
uncertainty on the weighted skew on quantile estimation. The needed relationship was
developed in this paper and its use was also illustrated in a regional flood study with 162
sites from South Carolina. The results were both accurate and hydrologically reasonable.
This paper also looks at new statistical diagnostic metrics such as a condition number to
check for multicollinearity, a new pseudo R2 appropriate for use with GLSR, and two error
variance ratios. Micevski and Kuczera (2009) presented a general Bayesian approach for
inferring the GLSR regional regression model and for pooling with any available site
information to obtain more accurate flood quantiles for a particular site in NSW, Australia.
Tasker (1989), Vogel and Kroll (1990), Ludwig and Tasker (1993), Kroll and Stedinger
(1999) and Hewa et al. (2003) have used GLSR for regionalisation of low-flow statistics.
Madsen et al. (1995), Madsen et al. (2002 and 2009) employed the GLSR procedure in the
regional analysis of extreme rainfall in Denmark, while Overeem et al. (2009) used a
GLSR procedure to establish the correlation structure and infer uncertainty between
parameters of the GEV distribution for extreme rainfalls in the Netherlands. Haddad et al.
(2011a) presented a GLSR procedure that regionalises the parameters of the GEV
distribution for design rainfall estimation in Australia.
Further examples are given below that address regional models of the log-space skewness
coefficient, standard deviation and mean. The current methodology for flood frequency
analysis in Australia and the United States consists of fitting a LP3 distribution to the
gauged data by estimating the mean, standard deviation, and skew of the logarithms of the
annual maximum flows. The problem is that the at-site skewness (shape parameter)
CHAPTER 2
32
estimator is highly variable with typical record lengths often found in Australian data
(average record length of 33 years, Rahman et al., 2011). In order to improve the precision
of the estimator and to reduce uncertainty, Bulletin 17B recommends combining the at-site
estimator with a regional estimate of the skew coefficient (IACWD, 1982; McCuen, 1979;
Tung and Mays, 1981a, b; and McCuen and Hromadka, 1988). Tasker and Stedinger
(1986) applied WLSR to derive a generalized skewness estimator for the Illinois River
basin. They were unable to use GLSR because they did not know how to describe the
correlations among skewness estimators. Martins and Stedinger (2002a) have developed
simple equations for the cross-correlation among skewness (and shape parameter せ of GEV
and GPA distributions) estimators as a function of the cross-correlation of the flood flows
themselves. Martins and Stedinger (2002a) employed those equations to implement a
GLSR model for regional skew estimation.
Reis et al. (2003 and 2005) introduced a Bayesian approach to parameter estimation for the
GLSR regional regression model developed by Stedinger and Tasker (1985) for
hydrological analysis. The results presented in Reis et al, (2005) show that for cases in
which the model error variance is small compared to sampling error of the at–site
estimates, which is often the case for regionalisation of a shape parameter, the Bayesian
estimator provides a more reasonable estimate of the model error variance than the method
of moments (MOM) and maximum likelihood (ML) estimators. This paper also presented
regression statistics for WLSR and GLSR models including pseudo analysis of variance, a
pseudo R2, error variance ratio (EVR) and variance inflation ratio (VIR), leverage and
influence. The regression procedure was illustrated with two examples of regionalisation.
Results obtained from OLSR, WLSR and GLSR procedures using the Bayesian and MOM
model error variance estimators were compared. Gruber et al. (2007) and Gruber and
Stedinger (2008) further develop the Bayesian GLSR (BGLSR) framework first presented
by Reis et al. (2005). This operational regression methodology is used in the estimation of
regional shape parameters, as well as flood quantiles. The focus of this study was to also
implement the Bayesian GLSR framework in conjunction with diagnostic statistics
presented by Tasker and Stedinger (1989), Reis et al. (2005), Reis (2005) and Griffis and
Stendinger, (2007). The new diagnostics statistics for use with Bayesian GLSR provided
comprehensive examination of the developed regression models. More recently, there have
also been some further developments in the BGLSR area for log-space skew estimation for
the non desert regions of California (Parrett et al., 2010). Lamontagne et al. (2011) and
CHAPTER 2
33
Veilleux et al. (2011) also used BGLSR for the estimation of log-space skews for annual
maximum day rainfall flood volumes in the Central Valley and surrounding areas of
California.
2.6 FIXED REGIONS AND THE REGION OF INFLUENCE IN REGIONAL
FLOOD FREQUENCY ANANALYS
2.6.1 FORMATION OF REGIONS
In regional flood frequency analysis, regions have often been defined based on
state/political boundaries. In ARR 1987, regional flood estimation methods were developed
for various Australian states based on fixed regions. The problem with this type of fixed
regions is that at state/regional boundaries, two different methods can provide quite
different flood estimates. To avoid this problem, regions have also been identified in
catchment characteristics data space using cluster analysis (Acreman and Sinclair, 1986),
Andrews curves (Nathan and McMahon, 1990) and various other multivariate statistical
techniques. One limitation with this type of region is that a correct method of assigning an
ungauged catchment to a ‘homogeneous’ region needs to be formulated, which is often
problematic. If the ungauged catchment is assigned to the wrong region/group, the resulting
flood estimation is associated with a high degree of error.
2.6.2 REGION OF INFLUENCE VS FLEXIBLE REGION
Since hydrological characteristics do not change abruptly across state boundaries, it is
desirable to avoid fixed boundaries. Regionalisation without fixed regions was performed
by Acreman and Wiltshire (1987) and Acreman (1987), and based on their work; the region
of influence (ROI) approach was introduced by Burn (1990a, 1990b) where each site of
interest (i.e. catchment where flood quantiles are to be estimated) has its own region. This
way the defined regions may overlap and gauged sites can be part of more than one ROI for
different sites of interest. The great advantage of the ROI approach is that it is not bounded
by geographic regions often based on political boundaries such as state lines, and it thus
avoids discontinuities at the boundaries of regions.
The ROI for the site of interest is formed out of stations in close proximity, with proximity
measured using a weighted Euclidean distance in an M-dimensional attribute space. The
distance metric is defined by:
CHAPTER 2
34
21
1
2
,
M
m
m
j
m
imji XXWD
(2.6)
with Di,j as the weighted Euclidean distance between site i and j, M is the number of
attributes included in the distance measure, and the X terms denote standardised values for
attribute m at site i and site j, and Wm is a weight applied to attribute m reflecting the
relative importance of the attribute. Standardisation of attributes removes units and avoids
introduction of bias due to scaling differences of the attributes. In a range of studies (Burn,
1990a; Zrinji and Burn, 1996; Tasker et al., 1996; Merz and Blöschl, 2005; Eng et al., 2005
and Eng et al., 2007a) the attributes were standardised by the standard deviation over the
entire dataset of attribute m. Attributes can arise from two sources, either based on physical
features, such as catchment area, stream length, channel slope, stream density, or soil type,
or statistical measures of climate and flow data, such as the coefficient of variation.
Since the inception of the ROI procedure, it has been found the ROI can result in improved
flood quantile estimates in terms of root mean square error and that ROI offers the
flexibility of variable regions (Zringi and Burn, 1996). They went on further to refine the
initial ROI approach into a hierarchal ROI approach. The hierarchical ROI approach was
found to perform better for the estimation of higher order moments (i.e. skewness), this is
the case where more sites are needed to form a region. It was found in this study that the
hierarchical ROI approach improved flood estimates in the extreme range. Tasker et al.
(1996) compared five different methods for developing regional regression models to
estimate flood quantiles at ungauged sites in Arkansas, United States. The methods looked
at traditional flood estimation regression approaches, multivariate techniques of cluster and
discriminant analysis and a ROI approach based on geographical and catchment attribute
space where the n gauging sites with the smallest distance made up the ROI for site i. The
study concluded that the ROI approach (based on catchment attributes space) outperformed
the other methods based on the lowest root mean square error.
Eng et al. (2005) used different ROI approaches for estimating the 50 years ARI flood
quantile at ungauged sites in a case study for the Gulf Atlantic Rolling Plains of the
southeastern United States. OLSR was used to regress flood statistics against catchment
CHAPTER 2
35
characteristics for each ungauged site based on data from ROI containing the n closest
gauging sites in both geographical (GROI) and catchment attributes space (CROI). Model
performance was based on the prediction errors from independent testing. From this
testing, it was shown for the two ROI approaches using the n closest gauging sites (based
on geographical distance) was better than using a distance measure in catchment attributes
space. They also found that GROI produced lower errors than CROI.
Merz and Blöschl (2005) examined the predictive performance of several flood
regionalisation methods. They performed the assessment using a jackknife comparison of
at-site estimated regionalised flood quantiles for 575 Austrian catchments. The ROI
methods that only used catchment attributes performed relatively poorer to the methods
that used geographical proximity. The ROI used in this study was then combined with
multiple regression. Merz and Blöschl (2005) were able to demonstrate that when spatial
dependency was incorporated, the ROI showed less random errors.
Eng et al. (2007a) proposed a hybrid ROI (HROI) which combined the GROI and CROI in
a GLSR framework. They applied this method to 1,091 catchments in the southeastern part
of the United States to estimate the 50 years ARI flood quantile. Their study was able to
show that the HROI yielded smaller root mean square estimation errors while also
producing fewer extreme errors often found in either GROI or CROI. From this study it
was concluded that for the 50 years ARI flood quantile, the similarity with respect to
catchment attributes was important, however it was incomplete and that the consideration
of the geographical proximity of the sites provided a useful surrogate for characteristics
that were not included in the analysis. Eng et al. (2007b) went on to also present an
enhanced GLSR and ROI framework that is based on a leverage-guided ROI. This
procedure used two newly defined ROI leverage and influence metrics. They applied their
method to 996 catchments in the southeastern part of the United States. This new leverage-
guided ROI regression provided improvements in terms of lower root mean square errors
while also eliminating all the influential observations.
Gaál et al. (2008) also presented a number of different regional approaches to regional
frequency analysis utilising L-moments and the GEV distribution with the main focus on
the ROI approach for modelling heavy rainfall amounts in Slovakia. This study used
various pooling schemes using different alternatives of sites similarity (pooling groups
CHAPTER 2
36
defined according to climatological characteristics and geographical proximity of sites,
respectively) and pooled weighting factors. The performance of the ROI methods presented
with at-site and other conventional regional methods was assessed through Monte Carlo
simulation studies for rainfall annual maximum series for the 1 and 5 day durations. The
results showed that all the frequency models based on the ROI produced growth curves that
were superior to at-site and conventional regional estimates for most of the sites studied. The National Committee on Water Engineering intends to test the applicability of the
Bayesian GLSR method for Australian catchments which may form the basis of the
revision of the regional flood frequency methods in ARR (Project 5 Regional Flood
Methods). While both the ROI and GLSR have been applied before in a QRT framework
(see Eng et al., 2007a,b), there has been no comprehensive comparison between ROI and
fixed regions in a BGLSR framework. Moreover, there has been no solid comparison
between the estimation of quantiles and the parameters of probability distributions in a ROI
framework.
This thesis as stated above uses the Bayesian approach to the analysis of a GLSR model for
hydrologic statistics (Reis et al., 2005). This relatively new approach is expanded on to
allow computation of the posterior distributions of the parameters and quantiles of the LP3
distribution and the model error variance using a quasi-analytic approach. The Bayesian
approach (Reis et al., 2005) provides both a measure of the precision of the model error
variance that the traditional GLSR lacks and a more reasonable description of the possible
values of the model error variance in cases where the model error variance is smaller
compared to the sampling errors.
The ROI method used in this thesis improves on the current ROI approaches (e.g. Tasker et
al., 1996) where the minimisation of the regression models predictive error variance rather
than selecting or assuming a fixed number of sites to minimise a distance metric is sought.
More details regarding the application of this method are provided in Chapter 3.
2.7 MODEL VALIDATION IN HYDROLOGICAL REGRESSION
ANALYSIS
In multiple linear regression analysis, it is to be resolved which set of the predictor
variables is the best suited or the most optimal for inclusion in the regression equation
CHAPTER 2
37
without over fitting the model and which of the many candidate models is the most
parsimonious one for making the most reliable prediction for the ungauged catchment case;
i.e. addition of unnecessary predictor variables often leads to weaker models (e.g.
producing greater uncertainty).
Validation is generally used to assess a model’s performance in hydrologic regression
analysis. In the validation approach, a fixed percentage of the data (e.g. 10%, 20%) is left
out while building the model, and then the developed model is tested on the left out data,
which is not used in the model building (i.e. validation data set). The validation procedure
has some appealing and important properties, e.g. it assists to select an appropriate model
according to its prediction ability, while at the same time evaluating the prediction ability
of the model for ungauged catchments.
2.7.1 HISTORY OF MODEL VALIDATION
During the last twenty years, the application of different validation methods has been
widely used in different fields of sciences such as Chemometrics (Faber and Kowalski,
1997 and Song Xu et al., 2005) and Econometrics (Racine, 2000), examples include
selection of a model in both univariate and multivariate calibrations using real and
simulated data sets. Song Xu and Zeng Liang (2001) and Song Xu et al. (2005) provided a
detailed study of leave-one-out (LOO) vs. Monte Carlo cross validation (MCCV) in
multivariate calibration and quantitative structure-activity relationship research. The history
of validation methods was summarised by Stone (1974) and Michaelsen (1987). Mosteller
and Tuckey (1977) presented a good introduction of validation methods also. Efron (1983)
and Bunke and Droge (1984) described the statistical behaviour of different validation
methods. In classical statistical literature, validation is most often referred to as LOO. In
LOO, one data point is left out while building a regression model (or other form of model)
and then the model is tested on the previously left out data point. The procedure is repeated
until all the data points are independently tested. Efron (1986) showed that LOO is not very
efficient in estimating prediction error. Marter and Martern (2001) pointed out that LOO
often results in over fitting and provides underestimation of the true prediction error of the
model. An asymptotically consistent method selects the best prediction model with
probability one as the sample size tends to infinity (n). With this definition, LOO has a
smaller chance of selecting the right model; that is, the probability becomes much smaller
than one (see Shao, 1993). In hydrologic regression, often a large number of predictor
CHAPTER 2
38
variables (e.g. catchment area, mean annual rainfall, design rainfall intensity, fraction
forest, soil indices, elevation and slope) are available; here LOO is likely to include
unnecessary predictor variables in the model (Shao, 1993). In such situations, the selected
model tends to perform well in calibration but quite poorly during prediction.
MCCV, a form of model validation, was first introduced by Picard and Cook (1984). Shao
(1993) proved that this method is asymptotically consistent and has a greater chance than
LOO of selecting the best model with more accurate prediction ability. The MCCV leaves
out a notable part of the sample at a time during model building and validation and repeats
the procedure many times. When compared with the ordinary methods for selecting the
best predictor variables (i.e. stepwise regression and employing statistics such as Mallows
Cp or P-value hypothesis) MCCV may be more desirable as it evaluates the different
models according to their predictive ability using many different combinations of
validation data sets. Interestingly, MCCV has not been tested in hydrologic regression
analysis where one often deals with a very limited and scarce observed data set.
2.7.2 PREVIOUS APPLICATIONS OF LEAVE-ONE-OUT VALIDATION IN
HYDROLOGY
The LOO has often been used in hydrology mainly due to limited sample size of
hydrological data; therefore to make the best use of the available data, a LOO has often
been adopted. There is an abundance of literature in regards to LOO in hydrological
applications; we present a few below on the various applications. LOO has found
popularity in the application of estimating rainfall statistics, low-flow indices quantiles and
flood quantile estimation. For example, Brath et al. (2003) investigated the statistical
properties of the rainfall extremes in northern central Italy, the reliability of the estimates to
ungauged sites was assessed by using a LOO validation. Di Baldassarre et al. (2006) used
Monte Carlo experiments and LOO validation in the estimation of uncertainty for design
rainfalls for ungauged sites in northern central Italy as well. Sun et al. (2011) used a LOO
validation for model evaluation in predicting monthly rainfall in the Daqing Mountains in
northern China.
The regionalisation of low flows has gained popularity recently and is of great importance
in hydrological studies, it is also a critical issue for the PUB initiative (i.e. Prediction in
Ungauged Basins of the International Association of Hydrological Sciences – IAHS; e.g.,
CHAPTER 2
39
Sivapalan et al., 2003). Castiglioni et al. (2009) presented the estimation of low flow
indices in ungauged catchments in Italy by applying deterministic and geostatistical
techniques for interpolating low flow indices in physiographical space. A LOO validation
procedure was adopted to quantify the accuracy of each technique when it is applied to
ungauged catchments. Through the LOO validation the conclusion was drawn that the
geostatistical techniques outperformed the deterministic ones. In Austria, Laaha and
Blöschl (2007) presented a national low flow estimation procedure for the whole country
for both gauged and ungauged catchments. In each step of the estimation procedure, many
alternative methods were tested by LOO validation. This led to the identification of the best
performing method for low flow estimation in Austria. Canonical correlation analysis
(CCA) was used in the estimation of low-flows in Greece as shown by Tsakiris et al.
(2011). Tsakiris et al. (2011) also used LOO validation to conclude whether CCA could
reliably assist in catchment classification into sub regions and also whether partitioning the
region into two sub regions offered improvements of low flow quantile estimates through
multiple linear regression.
Many studies in the literature can also be found regarding the use of LOO for validation of
RFFA models (for example see Sankarasubramanian and Lall, 2003; Merz and Blöschl,
2005; Juraj and Ouarda, 2007; Chowdury and Sharma, 2009; Kjeldsen, 2010; Iacobellis et
al., 2011). Merz and Bloschl (2005) examined the predictive performance of several flood
regionalisation methods for 575 Austrian catchments using a LOO comparison. Regional
flood-rainfall duration-frequency modelling at small ungauged sites was undertaken in
south-western Ontario, Canada by Juraj and Ouarda (2007). Model performance was
evaluated by a LOO procedure using evaluation statistics such as average bias and relative
root-mean square error. Chowdhury and Sharma (2009) applied a similar validation
technique which essentially resembled the LOO to predict and forecast arid river flows in
Australia. In the UnitedKingdom (UK), Kjeldsen (2010) used LOO in modelling the
impacts of urbanisation on flood frequency relationships. The LOO showed that the
developed adjustment factors were generally better at predicting the effects of urbanisation
on the flood frequency curve than the adjustment factors currently used in the United
Kingdom.
It can be seen through the literature review that little attention has been paid to-date to the
application of MCCV in RFFA and hydrological applications and the examination of the
CHAPTER 2
40
possible benefits that could be gained from the application of this procedure. Hence, this
thesis looks at three main issues as a part of its broad objective: (1) Demonstrating the
application of MCCV method in hydrological regression analysis using both the OLSR
and GLSR; (2) Comparison of the MCCV with the most commonly applied LOO
validation for selecting the most parsimonious regression model to be applied for ungauged
catchments; and (3) Demonstration of the best use of the limited datasets (often
encountered in hydrology) which can hinder the detailed validation of hydrological
regression models.
2.8 REGIONAL FLOOD FREQUENCY FOR LARGE TO RARE FLOODS
2.8.1 BRIEF REVIEW OF LARGE FLOOD ESTIMATION AND PREVIOIUS
APPLICATIONS
Estimation of large to rare and even extreme return periods is of absolute importance for
hydrological design and risk assessment for large infrastructure. The term ‘large’ floods
refer to floods with 50 to 100 years ARIs (Nathan and Weinmann, 2001). Floods in the
range from 100 years ARI to the ‘credible limit of extrapolation’ (ARI in the order of 2000
years) are referred to as ‘rare’ floods, while floods from the credible limit of extrapolation
to the PMF are termed ‘extreme’ floods. Due to knowledge and data limitation and the
uncertainty involved in extrapolating beyond available data, the errors in final estimates
can be quite high. The average size of recorded flood data referring to Australian small to
medium sized catchments is about 33 years (Rahman et al., 2009). To make better use of
this information and to be able to transfer this information to ungauged catchments again
regional estimation methods are used as described in section 2.4. Some studies both in the
past and present, on an international scale have looked at the advantages and disadvantages
of different regional models for large, rare and extreme floods, (Ferrari et al., 1993;
Kundzewicz et al., 1993; Katz et al., 2002; Castellarin, 2007; Castellarin et al., 2007; Vogel
et al., 2007; Van Gelder et al., 2007; Moisello, 2007; Majone et al, 2007; El Aldouni, 2008;
Laio et al., 2009; Castellarin, 2009; Calenda et al., 2009 and Gaume et al., 2010).
In Australia, the issue of large to extreme flood estimation in the past has been addressed
by some researchers (e.g. Pilgrim, 1986; Rowbottom et al., 1986; Pilgrim and Rowbottom,
1987; Stedinger et al., 1993; Nathan and Weinmann, 2001 and Haddad et al., 2010). Book
VI of Australian Rainfall and Runoff (ARR) was upgraded in 1999 with guidance for
estimation of large to probable maximum floods (PMF). The procedures outlined in
CHAPTER 2
41
ARR1999 include flood frequency analysis and various rainfall-based methods. For flood
frequency estimates in the range of ‘rare’ floods, use of regional information plus
paleohydrological information was suggested and for rainfall-based methods, an annual
exceedance probability (AEP) neutral approach was recommended (Nathan and
Weinmann, 2001).
The statistics of extremes have played an important role in engineering practice for water
resources design and management. There have been recent developments in statistical
theory of extreme values that can be applied to improve the rigour of these flood estimates
and to make the estimates more physically meaningful. The development of more rigorous
statistical methodology for regional analysis of large to rare floods as well as the extensions
in Bayesian methods can help to improve and quantify uncertainty in the estimation
procedure. Although the fundamental probabilistic theory of extreme values has been well
developed for a long time(Leadbetter et al., 1983; Coles, 2001; Cooley et al., 2006; Cooley
et al., 2010) the statistical modelling of large to rare floods remains a subject of active
research. Probability weighted moments (PWM) or L-moments are popular in application
to large and more extreme events hydrology than the ML approach (Katz et al., 2002). L-
moments possess computation simplicity and have very good performance in small samples
(Hosking, 1990 and Hosking et al. 1985).
Regional analysis is another way of making use of more available information that
originated with estimation of large to rare flood hydrology in mind (Dalrymple, 1960 and
Hosking et al., 1985; Jothityangkoon and Sivapalan, 2003; Castellarin et al., 2005;
Castellarin et al., 2007; Douglas and Vogel, 2006; Vogel et al., 2007; and Gaume et al.,
2010). The basic idea is that if a region is relatively homogenous then the estimation of
large to rare flood quantiles at a given site may be improved by using the larger
observations at other sites as well (i.e., a trade off between space and time). Castellarin et al
(2005) introduced an estimator of the exceedance probability associated with a regional
envelope curve (REC) for extreme flood estimation, which accounts for the impact of inter-
site correlation of annual floods. Douglas and Vogel (2006) provided an probabilistic
behaviour interpretation of floods of record in the United States (U.S.) for use with REC.
Castellarin (2007) and Castellarin et al. (2007) apply the probabilistic and multivariate
probabilistic REC) to real data in Italy for extreme flood estimation. They documented that
the multivariate extension outperforms the ordinary REC and provide flood quantile
CHAPTER 2
42
estimation at ungauged sites that are nearly as reliable as index flood quantiles. Vogel et al.
(2007) went onto enhance the method presented by Castellarin et al (2005) and Castellarin
et al. (2007) by introducing a general expression for the exceedance probability of an
envelope curve. A case study was implemented using historical flood series from 226 sites
located across the U.S. The results overall indicated that the approach introduced by Vogel
et al. (2007) offers significant promise for the estimation of large to extreme floods with
envelope curves for heterogeneous regions. More recently Gaume et al. (2010) proposed
approach based on standard regionalisation index methods for extreme floods in Slovakia
and the south of France. They created larger data samples by using historical, paleoflood or
extreme floods occurring in ungauged catchments to reduce the uncertainties on high return
period quantiles in a region.
It is well known that regionalisation models based on the “index flood” method assume
some sort of homogeneity in its application. In particular, it is assumed that the probability
distribution of the standardised variable obtained by normalisation of the annual maximum
flows with respect to the average of the population is the same in all the catchment sites
inside the homogeneous region. As a matter of fact the sample values of these moments
(coefficient of variation (CV) and the coefficient of skewness , if the analysis is limited to
second and third order moments) can vary in a very wide range. Hence since there is a high
variability associated with these parameters of a probability distribution the size of error in
the derived quantile estimates could be very large.
Recently, a new probabilistic model (PM) has been introduced (Majone and Tomirotti,
2004; Majone et al., 2007 and Haddad et al., 2011b) specifically for this sort of analysis.
Majone and Tomirotti (2004) originally calibrated the PM for Italian rivers, and extended
the method using 7300 historical series of annual maximum flows observed at gauging
stations belonging to different geographical areas around the world. Majone et al. (2007)
applied the PM to flood data from 8,500 gauging stations across the world and found that
the method can provide quite reasonable design flood estimates for ARIs in the range of
4,000-9,000 years. In a study conducted by Haddad et al. (2011b) on a data set containing
227 gauging sites from Australia (Victoria and NSW), it was found the PM model when
coupled with GLSR performs very well when applied to ungauged catchments and can
estimate the 200-400 year ARIs floods with reasonable accuracy.
CHAPTER 2
43
The PM (Majone and Tomirotti, 2004; Majone et al., 2007; Haddad et al., 2011b) is based
on the assumption that the standardised maximum values (Qmax) of the annual maximum
flood series from a large number of individual sites in a region can be pooled (considering
the across-sites variations in the mean and CV values of annual maximum floods). The
concept is similar to the Cooperative Research Centre for Catchment Hydrology Focussed
Rainfall Growth Estimation (CRC-FORGE) method (Nandakumar et al., 1997) where
extreme design rainfall estimates are based on pooled rainfall data from a large region up to
several hundred gauges (concept of expanding region). The particular advantages of the
Probabilistic Model is that it does not assume a constant CV across the sites as with the
index frequency approach; this feature, in particular, allows the PM to pool data more
efficiently over a very large region.
The PM termed “large flood regionalisation model” (LFRM) for large to rare floods as
described by Majone et al. (2007) and Haddad et al. (2011b) is chosen for further study as
another objective of this thesis. This method as discussed above is an empirical approach
that makes use of pooled data from various sites while taking into account differences in
means and the varying CV from site to site. This unique form of standardisation allows the
pooling of more data from many stations. As compared to standard methods, the
application of LFRM can overcome many of the difficulties, limitations and assumptions of
large to rare flood frequency analysis. The main focus of the LFRM in this thesis is
expanded to couple it with a spatial dependence model (such as the CRC-FORGE constant
spatial dependence model (e.g. Buishand, 1984; Hosking and Wallis, 1988; Dales and
Reed, 1989; Nandakumar et al., 1997; Stewart et al., 1999; Nanadkumar et al., 2000;
Castellarin et al. 2007; Vannitsem and Naveau, 2007; Guse et al., 2009) that reflects the
reduction in the net information available for regional analysis using spatially dependant
data (see also sections 2.2.3 and 2.4.2) . The LFRM is also to be extended to the ungauged
catchment case by coupling it with the BGLSR method and the ROI to estimate the mean
and CV of annual floods at sites where there is no or little data. The advantages of the
BGLSR and ROI methods have been already discussed in sections 2.5 and 2.6.
CHAPTER 2
44
2.9 IMPACT OF CLIMATE CHANGE ON FLOOD FREQUENCY
ANALYSIS
In the literature, it has been noted that the frequency and magnitude of extreme flood
events are likely to rise in the near future due to climate change (IPCC, 2007; BOM, 2012).
This may have implications to typical flood frequency analysis that assumes that the
‘stationarity assumption’ is valid (Khaliq et al., 2006). Researchers in non-stationary flood
frequency analysis in different parts of the world have questioned the validity of the
traditional flood risk assumptions of stationarity (e.g. Franks and Kuczera, 2002; Cunderlik
and Burn, 2003; Prudhomme et al., 2003; Micevski et al., 2006; Leclerc and Ouarda, 2007;
Pui et al., 2011 and Pall et al., 2011).
There have been number of studies on the identification of trends in flood data. For
example, Olsen et al. (1999) have reported positive trends in flood risk over time for
gauged sites within the Mississippi, Missouri, and Illinois River basins. Douglas et al.
(2000) discovered no evidence of trends in flood flows but they did find evidence of
upward trends in low flows at larger scale in the Midwest and at a smaller scale in Ohio,
the north central and the upper Midwest regions in USA. Negative trends in total
streamflow were most common for the analysed Pennsylvanian streamflow time series
from 1971 to 2001 due to climate variability (Zhu and Day, 2005). Novotny & Stefan
(2007) investigated the streamflow records from 36 gauging stations in five major river
basins of Minnesota, USA, for trend and correlations using the Mann-Kendall (MK) test
and moving averages method. They found that trends diffsignificantly from one river basin
to another, and became more prominent for shorter time windows. Pasquini and Depetris
(2007) presented an overview of flood discharge trends of South American rivers draining
the southern Atlantic seaboard. Juckem et al. (2008) found a decrease in annual flood peaks
for stream gauging stations in the Driftless Area of Wisconsin. Ishak et al. (2013) found
that about 15% of Australian stream gauging stations showed a trend, mainly downward in
eastern and south-eastern and south-westen parsts of Australia and upward in northern
Australia. It should be noted here that this study excluded these 15% stations from the
thesis as this focuses on stationary regional flood frequency analysis. However, the
outcome of study by Ishak et al (2013), part of the PhD thesis of Elias Ishak (another UWS
PhD student), will be used to develop an adjustment factor to correct for the non-
stationarity in regional flood frequency analysis.
CHAPTER 2
45
2.10 SUMMARY
The estimation of flood behaviour (both in terms of credible limit and beyond credible limit
of extrapolation) at ungauged catchments is a common problem in hydrology. Regional
flood frequency analysis (RFFA) is commonly used to “transfer” flood characteristics
information from gauged catchments to ungauged ones. In this chapter, the literature
review has covered a range of currently applied RFFA techniques with particular emphasis
to the quantile and parameter regression techniques (QRT and PRT).
The index flood method (IFM) has been discussed; which assumes the probability
distribution of floods at sites of homogenous regions is identical except for a site specific
scaling factor. Recent studies have shown positive results based on L-moments based IFM
in South-east Australia. However due to the large heterogeneity in Australian catchments a
method that does not strictly require homogeneous regions is more suitable for Australia.
The probabilistic rational method (PRM) is currently recommended in South–east Australia
for design flood estimation in small to medium sized ungauged catchments. Though
considered a regional method and easy to apply it has been criticised by researchers
because of the assumption of geographical contiguity in the mapping and application of the
runoff coefficients.
The QRT and PRT are multiple regression techniques which relate flood quantiles or the
parameters of a probability distribution (i.e. location, scale and shape which are related to
the mean, standard deviation and skew of the flood data) to catchment characteristics
assuming linearity. The advantage of both the QRT and PRT is that no assumptions are
made about runoff coefficients or geographical contiguity, or strict homogenous regions as
with the PRM and IFM, respectively.
The preferred methodology of the QRT and PRT is to use the generalised least squares
regression (GLSR) and in particular the Bayesian GLSR (BGLSR) approach as further
improvement in generalisations can be made with this method such as accounting for
sampling variability and cross correlated data and more importantly distinguishing between
model error and sampling error in the regional model. Furthermore, the Bayesian approach
CHAPTER 2
46
provides both a measure of the precision of the model error variance that the traditional
GLSR lacks and a more reasonable description of the possible values of the model error
variance in cases where the model error variance is smaller compared to the sampling
errors.
The concept of fixed regions and region of influence (ROI) approach in RFFA has also
been discussed. Both the advantages and disadvantages have been outlined. The past
studies presented have all showed the improvements of the ROI over a fixed region
approach. Keeping this in mind along with the high heterogeneity in Australian catchments
it makes sense to combine and compare the methods of the QRT and PRT under a BGLSR
framework with the ROI approach.
Model validation is a very important part of RFFA especially in the area of hydrological
regression. The concept of model validation using split-sample, leave-one-out validation
(LOO) and Monte Carlo cross validation (MCCV) has been discussed. Past studies of the
application of LOO in hydrology have been presented, while studies relating to the use of
MCCV in other fields of science have also been discussed. Given the lack of application of
MCCV in hydrological regression, this thesis will compare both LOO and MCCV for
RFFA model validation in the state of New South Wales in Australia.
Finally, estimation of large to rare and even extreme return periods is of absolute
importance for hydrological design and risk assessment for large infrastructure. The
statistical modelling of large to rare floods remains a subject of active research. A brief
history of large flood estimation has been given in this chapter along with recent studies
and applications in this field. This thesis will present a new large flood regionalisation
model (LFRM) that pools data more efficiently over a very large region. The LFRM will be
combined with a newly developed spatial dependence model that reflects the reduction in
the net information available for regional analysis using spatially dependant data.
CHAPTER 3
47
CHAPTER 3: ADOPTED STATISTICAL TECHNIQUES FOR
REGIONAL FLOOD FREQUENCY ANALYSIS AND MODEL
VALIDATION
3.1 GENERAL
This chapter provides an overall description of the statistical techniques adopted in this
study for (i) regional flood frequency analysis (RFFA) in the range of 2 – 100 years
average recurrence intervals (ARIs) and (ii) Validation of regional hydrological regression
models using leave-one-out (LOO) and Monte Carlo cross validation (MCCV) techniques.
At the outset, a flow chart (Figure 2) is provided which summaries the statistical
procedures and methodologies adopted in this thesis.
At the beginning, log Pearson type 3 (LP3) distribution is described which is fitted to the
observed annual maximum flood series data using a Bayesian parameter estimation
procedure. A discussion is then be presented on the quantile and parameter regression
techniques (QRT and PRT), while the basic theory has been introduced in Chapter 2,
further emphasis is given here on the generalised least squares regression (GLSR), in
particular, the Bayesian GLSR (BGLSR). The region of influence (ROI) approach is then
discussed in the light of its application with the BGLSR. Here the application of the ROI is
based on the minimisation of the predictive variance, which is applied with both the QRT
and PRT regression techniques. The methodology outlined here is intended to highlight the
assumptions involved and to give an overview of how to deal with the various uncertainty
associated with the data to obtain reliable flood estimates.
The next part of this chapter discusses the mathematical formulations used in the model
validation. The theory behind LOO and MCCV is presented (as outlined in Song Xu et al.,
2005), with an emphasis on the hydrologic regression analysis using ordinary least squares
regression (OLSR) and GLSR-based QRT.
.
CHAPTER 3
48
Figure 2 Flow chart showing statistical techniques/ methods adopted in this thesis
Major steps in the research
Data collation
Streamflow data
-Filling missing data, trend
analysis, rating curve error
analysis, outlier testing
- Catchment and climatic
characteristics data
Regional flood frequency
analysis – (ARIs 2 – 100
years)
Comparison of Bayesian GLSR using QRT
and PRT in a fixed region framework
Comparison of Bayesian GLSR using QRT
and PRT in a region of influence framework
Application of LOO and MCCV for
the validation of regional regression
models
Case study for NSW using OLSR
and GLSR
Conclusions and
Recommendations
Comparison and
validation of
methods using data
from NSW, VIC,
QLD and TAS
At–site flood frequency
analysis (LP3 distribution)
Large flood regionalisation model
(LFRM) development using data
from all Australia
Collation of streamflow data,
Homogeneity testing,
Finding appropriate distribution
LFRM and spatial dependence
model development
Ungauged catchment application
At – site flood frequency
analysis
(GEV distribution)
CHAPTER 3
49
3.2 AT-SITE FLOOD FREQUENCY ANALYSIS
3.2.1 BASICS OF AT-SITE FLOOD FREQUENCY ANALYSIS
At–site flood frequency analysis is an elementary step in any RFFA study. The primary
objective of flood frequency analysis is to relate the magnitude of extreme events to its
frequency of occurrence through the use of probability distributions (Chow et al., 1988).
Data observed over an extended period of time in a river system are analysed using
frequency analysis techniques. The data for flood frequency analysis are assumed to be
independent and identically distributed. The flood data are considered to be stochastic and
space and time independent. Furthermore, it is assumed that the flood data have not been
affected by natural or manmade changes in the hydrological regime and climate change
(stationarity assumption).
In flood frequency analysis, a unique relationship between a flood magnitude and the
corresponding ARI T is sought. The task as stated is to extract information from a flow
record to estimate the relationship between Q and T. Three different models may be
considered for this purpose (Cunnane, 1989). These models are (1) annual maximum series,
(2) partial duration series or peaks over threshold series, and (3) time series model. For this
study, annual maximum series flood data is adopted.
Australian Rainfall and Runoff (ARR) (I. E. Aust, 1987) recommends the LP3 distribution
fitted with the method of moments (MOM) for use in at-site flood frequency analysis.
However, research has shown that the reassessment of the LP3 distribution/MOM
estimation is overdue (Wallis and Wood, 1985; Vogel et al., 1993). The recommendations
currently being prepared by the National Committee on Water Engineering include a
variation to the current MOM to Bayesian fitting procedures to estimate the parameters of
the probability distributions used in at-site flood frequency analysis (Kuczera and Franks,
2005). Hence, this method is adopted in this study to estimate the at-site flood quantiles.
The LP3 Bayesian procedure has shown satisfactory results in the study area as
demonstrated by Haddad and Rahman (2008) and Rahman et al. (2011).
CHAPTER 3
50
3.2.2 FLIKE SOFTWARE FOR AT-SITE FFA
The at-site flood quantiles are estimated by FLIKE, which is a computer program
developed by Professor George Kuczera of the University of Newcastle. The FLIKE
program facilitates Bayesian analysis and the method of L-moments for parameter
estimation. The following section briefly describes the LP3 probability distribution.
Kuczera (1999b) presents how FLIKE obtains initial parameter values when searching for
the most probable values.
3.2.3 LOG PEARSON TYPE 3 (LP3) DISTRIBUTION
The LP3 probability model has the following probability distribution function (pdf):
kx0,くfork;x0,くfor
0α;k)xe
く(logexp1αk)xe
く(log)Γ(
くk)く,α,|x
ef(log
(3.1)
with ( being the gamma function.
The LP3 model has been widely accepted in practice because it consistently fits flood data
as well if not better than other probability models. When the skew of logex is zero, the
model simplifies to the log normal. The model, however, is not well-behaved from an
inference perspective. Direct inference of the shape parameter the scale parameter and
the location parameter causes numerical problems. For example, when the skew of logex
is close to zero, the shape parameter tends to infinity. Experience indicates that it is
preferable to fit the first three moments (, and ) of logex rather than and
(Kuczera, 1999b).
This parameterisation is based on the mean (), standard deviation (), and skewness () which is often used to calculate the T-year event quantile:
)( TT KQ (3.2)
where )(TK is the frequency factor which is the T-year quantile of the Pearson type 3 (P3)
distribution with mean zero and standard deviation of 1, and skewness . The frequency
CHAPTER 3
51
factor TK can be approximated with sufficient accuracy by the Wilson-Hilferty
transformation (Kirby, 1972 and Rao and Hamed, 2000) for < 2:
11
66
23
zKT (3.3)
where z is the T-year quantile of the standard normal distribution.
Problems however may arise when the skew of logex is negative, the upper bound on flows
can cause problems. FLIKE avoids this problem by starting the search for the most
probable parameters using log normal MOM parameters fitted to the flood data. This
strategy is quite robust because when the skew of logex is zero, the flow bounds are pushed
all the way to infinity. As a result, the search starts in a region of parameter space well
removed from the constraints imposed by the flow bounds.
Furthermore, a serious problem arises when the absolute value of the skew of logex exceeds
2; that is, when ≥ 1 and when < 1, the LP3 has a gamma-shaped density. However,
when ≥ 1, the density changes to a J-shaped function. Indeed when = 1, the pdf
degenerates to that of an exponential distribution with scale parameter and location
parameter . For ≥ 1, the J-shaped density seems to be over parameterised with three
parameters. The posterior density surface reveals extremely elongated contours which are
suggestive of an over parameterised model. In such circumstances, it is pointless to use the
LP3 distribution. Under this circumstances, it is suggested either the generalised extreme
value (GEV) or generalised pareto (GPA) distributions be used as a substitute (Kuczera
1999b). In this study, no prior information was used in fitting the LP3 distribution. The
parameters of the LP3 distribution (i.e. mean, standard deviation and skewness) were also
extracted from the FLIKE software for use with the RFFA.
3.3 THE CLASSICAL GLS REGRESSION PROBLEM
This section focuses on the basic generalised least squares regression (GLSR) model and
discusses the classical assumptions for this procedure. Subsequent sections recast the
analysis of the GLSR model in a Bayesian framework following Reis (2005) and Reis et al.
(2005). Streamflow data, be it annual maximum or partial duration series data sets, can be
used to derive an empirical relationship between catchment/climatic characteristics
CHAPTER 3
52
variables and the hydrologic statistic of interest. For instance, catchment area and design
rainfall intensity may be used to estimate hydrologic characteristics at a site, such as the
mean annual flow, the 10-year or 100-year peak flow, or the shape parameter of a
theoretical probability distribution like the log-space skewness coefficient () used to fit a
LP3 distribution, or the shape parameter (せ) of a GEV or GPA distribution.
The GLSR model assumes that the quantity of interest yi at a given site i can be described
by a linear function of catchment/climatic characteristics (or a transformation there of) with
an additive error. In matrix notation, the model is represented by:
iXβy (3.4)
where X is a (n × k) matrix of catchment characteristics augmented by a column of ones, く
is a (n × 1) vector of regression parameters that must be estimated and is an (n × 1) vector
of random errors for each of the n sites used in the regression assumed to be normally
distributed with zero mean and the covariance matrix of the form:
Ω)E(2 T (3.5)
wherein 2 is the model error variance and is a positive definite symmetric matrix
(Johnston, 1972; Rencher, 2000; Koop, 2005). Different choices of the matrix allow one
to make different assumptions regarding the nature of the model errors. If Ω is equal to the
identity matrix I, the problem is homoscedastic, and the GLSR model reduces to OLSR.
Uncorrelated errors with different variances at different sites can be described using a
matrix with different variances of the diagonal and zero off the diagonal. In this case, the
GLSR model in Equation (3.5) reduces to weighted least squares regression (WLSR)
model. In the more general case, is defined in such a way that it reflects both
heteroscedasticity and correlation among residuals.
According to the Gauss-Markov-Aitken theorem, where is known, the minimum
variance unbiased estimator for く does not depend on 2 and is given by (Rao and
Toutenburg, 1999 and Koop, 2005):
yΩXXΩX 1TT 11 )(ˆGLS (3.6)
CHAPTER 3
53
The equation above is defined as the GLSR estimator denoted by GLS . Note that the
subscript GLS is sometimes omitted for brevity. The unbiased estimate of 2 is given by:
1
)ˆ()ˆ(ˆ
12
kn
T
GLSGLSGLS βXyΩβXy (3.7)
with sampling covariance matrix:
112 )()ˆvar( XΩX
T
GLS (3.8)
The GLSR estimator is also the best linear unbiased (BLUE) estimator in all classes of
linear estimators. Since the matrix is unknown, we need to use an estimator, which is
usually denoted by .
3.3.1 GLSR, THE STEDINGER AND TASKER MODEL
Stedinger and Tasker (1985, 1986) developed a GLSR model for regional hydrologic
regression. The important difference from the OLSR and the classical GLSR model of the
form given by Equation (3.4) lies in the development and partition of the covariance matrix
of the errors. The GLSR model of Stedinger and Tasker (1985) assumes that the total error
results from two sources: model errors i that are assumed to be independently distributed
with mean zero, 0iE and common variance:
ji0ji,Cov
2 ji (3.9)
and sampling errors that arise due to the fact the actual values of yi are unknown and only
estimates of the quantities of interest are available.
Therefore, Equation (3.4) becomes (following Reis et al., 2005):
iXβhηXβy ˆ (3.10)
where is the sampling error in the sample estimators. Thus, the regression-model errors i
are a combination of: (i) time-sampling-error in the sample estimators iy of yi and (ii)
CHAPTER 3
54
underlying model error i (lack of fit). The total error has mean zero and covariance
matrix:
yIΛii ˆ22 TE (3.11)
where Σ( y ) is the covariance matrix of the sampling errors in the sample estimators (such
as the flood quantiles or the parameters of the LP3 distribution – see Equation (3.2)). Time-
sampling errors in estimators of the yi’s are usually correlated among sites because flows at
nearby sites have similar hydrological mechanisms (e.g. meteorology). Reasonably
accurate estimation of the sampling covariance matrix in the GLSR is very important and is
of great concern and is vital to the solution of the GLSR equations. More details about the
construction of Σ( y ) for flood quantiles and statistics are given in section 3.5 and can be
can be read in Stedinger and Tasker (1985 and 1986).
In this regional framework 2 can be viewed as a heterogeneity measure. Madsen et al.
(1997 and 2002) showed that the regional average GLSR estimator is a general extension
of the record-length-weighted average commonly applied in the index flood method;
however the record-length-weighted average estimator neglects inter-site correlation and
regional heterogeneity (Stedinger et al., 1993 and Stedinger and Lu, 1995).
The GLSR estimator of β is given by:
y)Λ(XX)Λ(X ˆj)j(ˆ 12
δ112
δ TT
GLS (3.12)
The sampling covariance matrix thus becomes:
112 ))(ˆ(ˆ)ˆvar( XX T
GLSGLS (3.13)
The model error variance 2
δj is due to an imperfect model and is a measure of the precision
of the true regression model. Unfortunately the model error is not known and needs to be
estimated. Stedinger and Tasker (1986) proposed a MOM estimator where 2
δj can be
solved iteratively solving Equation (3.12) along with generalized residual mean square
error (MSE) equation given by Equation (3.14):
CHAPTER 3
55
)1()ˆˆ()]ˆ(ˆ[)ˆˆ( 12 knGLS
T
GLS βXyyIβXy (3.14)
In some situations, the sampling covariance matrix explains all the variability observed in
the data, which means that the left-hand side of Equation (3.14) will be less than n – (k + 1)
even if 2ˆ is zero. In these circumstances, the MOM estimator of the model error variance
is generally taken to be zero (Stedinger and Tasker, 1985; 1986). Alternative methods for
estimating the model error variance by maximum likelihood estimation (ML) can be seen
in Kuczera (1983a) and Rencher (2000).
Based on Monte Carlo simulations, Stedinger and Tasker (1986) showed that the MOM
model error variance 2ˆ procedure provides faster and more robust results since no
assumptions are made about the distribution of the residuals, and less biased when the true
model error variance is moderate to large (usually the case for flood quantiles and mean
flood estimation). Stedinger and Tasker (1986) also from their simulation study using
various cross-correlations among concurrent flows (0, 0.3 and 0.6) showed that for small
2ˆ MLs were much more accurate. Actually, the ML estimator for 2ˆ always had a
smaller MSE than the MOM estimator. If the regional regression analysis exhibits a small
model error variance 2ˆ , i.e. this is the case when sampling errors dominate the regional
analysis (e.g. with the regionalisation of the shape parameters of probability distributions
i.e. skewness estimators), the ML procedure should be preferred to the MOM estimator.
Bayesian analysis, which is based on the likelihood function, is also a good candidate for
these situations, and would address the bias concern because on average over the prior,
Bayesian estimators are unbiased (Stedinger, 1983).
3.4 BAYESIAN METHODOLOGY
Reis et al. (2005) developed a Bayesian approach to estimate the regional model
parameters and showed that the Bayesian approach can provide a realistic description of the
possible values of the model error variance, especially in the case where sampling error
tends to dominate the model errors in the regional analysis (Madsen et al., 2002; Reis et al.,
2005 and Haddad et al., 2012). This thesis extends the work of Reis et al. (2005) and
applies the BGLSR to estimate the parameters and flood quantiles of the LP3 distribution –
CHAPTER 3
56
see Equation (3.2)). The BGLSR is chosen as the desired framework as the current GLSR
model analysis methodology based on Tasker and Stedinger (1989) and Griffis and
Stedinger (2007) do not provide an estimate of the uncertainty in the estimated model error
variance of the flood quantiles and the first two moments of the LP3 distribution.
3.4.1 CLASSICAL BAYESIAN INFERENCE
In a Bayesian framework, the parameters of the model are considered to be random
variables, whose pdf should be estimated. The Bayesian approach combines any data with
prior information (if available) about the parameters being estimated (see also section
2.4.3). This information usually is established from other data sets, previous studies or
specific knowledge about the behavior of the system being analysed. Parameter estimation
is made through the posterior distribution which is developed using Bayes’ rule: (Zellner,
1991):
dIp
IpIp
)()|(
)()|()|( (3.15)
Here, )|( Ip is the posterior distribution of the parameter vector , )|( Ip is the
likelihood function for the data, and )( is the prior distribution of . The denominator is
a normalising constant that ensures that the area under the posterior pdf equals one. Reis et
al. (2005) developed a Bayesian approach to estimate the regional model coefficient of the
log-space skewness and showed that the Bayesian approach can provide a realistic
description of the possible values of the model error variance. It is advantageous to provide
a full posterior distribution of the parameters which is done by the Bayesian approach as
compared to classical methods which usually give a point estimate of the parameters.
3.5 BAYESIAN GLS REGRESSION
3.5.1 APPROACH ADOPTED IN THIS STUDY FOR THE QUANTILE AND
PARAMETER REGRESSION TECHNIQUES
To regionalise the flood quantiles (QT), the sampling covariance matrix () of the LP3
distribution is required. Tasker and Stedinger (1989) and Griffis and Stedinger (2007) (p.
84, Eq. 4) provide the approximate estimator of the components of matrix of the LP3
distribution which is given by:
CHAPTER 3
57
jiforn
KKQi
iiiiijiT 222
, )75.01(5.01)(
jifornn
mKKKKQ
ji
jiij
ijjiijjijjiijiT )75.0(5.05.05.01)( , (3.16)
where K is standard LP3 frequency factor, mij is the concurrent record length between sites
i and j, ρij is the lag zero cross correlation of flood peaks between sites i and j, and σi and σj
are the population standard deviation at sites i and j respectively. The skew and standard
deviation in the matrix (3.16) are subject to estimation uncertainty as well. In this study
to avoid correlation between the residuals and the fitted quantiles, following methods are
adopted:
(i) the inter site correlation between the concurrent annual maximum flood series
(ρij) is estimated as a function of the distance between sites i and j;
(ii) the standard deviations (of the logarithms of annual maximum flood series) σi
and σj are estimated using a separate OLSR and GLSR using the explanatory
variables used in the study (given in Chapter 4); and
(iii) the regional skew (of the logarithms of annual maximum flood series) is used in
place of the population skew as suggested by Tasker and Stedinger (1989).
This analysis above used the regional estimates of the standard deviation and
skew obtained from the BGLSR. The detailed information on the covariance
matrices associated with the standard deviation and skew can be found in Reis
et al. (2005) and Griffis and Stedinger (2007). Here we provide an overview of
the covariance matrices.
It is necessary to carry out GLSR on the sample standard deviation and skew, because both
these parameters have an associated estimation error, the approximation in Equation 3.16
should be updated to reflect all the uncertainty associated with the sampling error in the
quantile estimates. The needed estimator of the sample covariance matrix for the standard
deviation and skew is given below in Equations 3.17 through to 3.20:
jiforn
si
iiji 22
, )75.01(5.0)(
(3.17)
jifor
nn
ms
ji
jiijijjiij
ji )75.0(5.0)( ,
CHAPTER 3
58
The off-diagonal elements of the sampling covariance matrix for the skew coefficient
include the term Cov[gi, gj] which is the covariance between the two at-site skew
estimators gi and gj. This term is obtained from:
]Var[]Var[],[Cov jiggji ggggji
(3.18)
where the cross-correlation jigg is estimated using the approximation developed by Martins
and Stedinger (2002a):
ijijijgg cfji
ˆ)ˆSign(ˆ (3.19)
wherein ))((/ jijiijijij nmnmmcf , ijm is the common record period and in , jn are
the extra observation period for station i and j, respectively. Values of are tabulated by
Martins and Stedinger (2002a) for .0.1 In addition, Var[gi] and Var[gj] are evaluated
using the following approximation derived by Griffis and Stedinger, (2007):
42 )(48
15)(
6
91)(
6]Var[ iiiii
ii ncnbna
ng (3.20)
wherein )( ina , )( inb and )( inc are corrections for small samples:
77.118.159.0
9.06.03.0
32
5.869.4531.7)(
86.341.3192.3)(
06.5075.17)(
iii
i
iii
i
ii
i
nnnnc
nnnnb
nnna
(3.21)
The regional skew Gi (Equation 3.22) is used in Equation 3.20 in place of the population
skew i to avoid correlation between the residuals and the at-site estimates of the skew.
Wherein
3
1
3
)2)(1(
)(6
1snn
xxn
nG
ii
n
i
i
i
i
(3.22)
CHAPTER 3
59
where xt is the logarithm of the annual maximum flows in the year t, and s is the sample
standard deviation of xt. Because the true values of skews at each site are unknown, the
regional mean of the skews is used in Equation 3.20.
For the parameter regression technique (PRT), GLSR is also adopted (Tasker and
Stedinger, 1989 and Griffis and Stedinger, 2007) using a Bayesian framework (Reis et al.,
2005) to develop regression equations for the parameters of the LP3 distribution (i.e. mean,
standard deviation, and skew coefficient of the logarithms of the annual maximum flood
series (i.e. , , ). The regional values of standard deviation and skew were found based
on Equations 3.18 to 3.22. The sampling covariance matrix for the mean flood () was
obtained following Stedinger and Tasker (1986) which is given below:
jiforn
qi
iji 2
,)(
jifornn
mq
ji
jiijij
ji ,)( (3.23)
3.5.2 ADOPTED BAYESIAN REGRESSION APPROACH – PRIOR FOR THE β
COEFFICIENTS
As discussed in section 3.4.1, in order to apply the Bayesian analysis to the regional
regression problem in this study, one needs to define prior distributions for the く
coefficients and for the model error variance.
With the Bayesian approach, it is assumed here that there is no prior information on any of
the β coefficients; thus, a multivariate normal distribution with mean zero and a large
variance (e.g. greater than 100) is used as a prior for the regression coefficients as
suggested by Reis et al. (2005). This prior is considered to be almost non-informative,
which produces a pdf that is generally flat in the region of interest.
A multivariate normal distribution prior is given by:
)()(5.0exp)2(
)(2/)1(
2/1
p
T
pk
P (3.24)
CHAPTER 3
60
wherein β has dimension k + 1 and Equation 3.24 is modelled with a mean vector p and
precision matrix P. Zellner (1971) also suggests that the prior can be represented by the
reciprocal of the variance. Zellner (1971) and Congdon (2001) also suggest that a 2
parameter gamma distribution can be used to represent the prior information.
The likelihood function for the data as suggested by Reis et al. (2005) is considered to be a
multivariate normal distribution, so that:
)()(5.0exp1
2),( 1
2/1
2/2 XβyXβy TnL (3.25)
where the covariance matrix is defined in Equation 3.11, n is the number of sites in the
region , y is the vector with the sample values of the hydrologic statistic of interest (i.e.
mean flood, flood quantile etc, and X is the matrix of explanatory variables (catchment
area, design rainfall intensity etc).
3.5.3 ANALYTICAL SOLUTION TO BAYESIAN APPROACH FOR THE
POSTERIOR OF THE MODEL ERROR VARIANCE
To compute the normalising constant in Equation 3.15 it is often useful to use Markov
Chain Monte Carlo (MCMC) algorithms such as the Metropolis-Hastings or Gibbs sampler
algorithms (e.g. Kuczera and Parent, 1998; Micevski and Kuczera, 2009 and Reis et al.,
2003). These algorithms are usually adopted for computationally intense problems, which
really depend on the dimension or complexity of the model being analysed. Given that the
dimension of this problem is relatively straight forward, it can be solved more easily using
the quasi-analytical approximation of the marginal posterior of the model error variance as
discussed by Kitanidis (1986) and Reis et al. (2003 and 2005). Below a brief overview of
equations and steps involved are outlined.
In more simple cases, it is possible to integrate the joint posterior of 2 and over the
possible values of to obtain numerically the marginal posterior of 2 except for the
normalising constant, and hence:
dIfIf ),(),|()|( 222 (3.26)
where
CHAPTER 3
61
),|( 2 If is the likelihood function and ),( 2 is the joint prior for 2 and . The
likelihood function is approximated by a multivariate normal distribution with the
covariance matrix (Λ) Equation given by (3.11).
)()(5.0exp1
2),|( 1
2/1
2/2 XβyXβy TnIL (3.27)
When a non-informative prior for 2 is used with a truly non-informative uniform prior for
results in the joint prior:
2
),( 2 e (3.28)_
The posterior distribution for 2 alone by evaluation can be found:
deIf T )()(5.0exp)|( 12/122
XβyXβy (3.29a)
The expression above can be rewritten as:
deIf TT )ˆ()ˆ()ˆ()ˆ(5.0exp)|( 12/122
XΛXβXyΛβXy1T
(3.29b)
Now, the first term inside the brackets is not a function of and hence can be taken
outside the integral. The integration then becomes:
2/1
2/)2()ˆ()ˆ(5.0exp
XΛXXΛX
1T
1T
n
T d (3.30)
Therefore, the posterior distribution of 2 is proportional to:
)ˆ()ˆ(5.0exp)|( 12/1
122
βXyβXy TT XXeIf (3.31)
The equation as expressed by (3.31) can then be used to calculate numerically the posterior
pdf of the model error variance ( 2 ), and its mean and variance without the need of using
more sophisticated methods based on Monte Carlo simulation. The pdf of the model error
CHAPTER 3
62
variance may also be used to calculate the posterior distribution of the coefficients
using:
222 )|(),|()|( dIfIfIf (3.32)
where ),|( 2 If is a multivariate normal distribution. This result turns out to be a simple
extension to the GLSR procedure developed in Stedinger and Tasker (1985). If one
employs the use of efficient numerical integration procedures, the integral in Equation 3.32,
as well as the mean and variance of 2 , are easily computed.
3.5.4 PRIORS FOR THE PARAMETERS AND THE QUANTILES OF THE LP3
DISTRIBUTION
It is well known that no model is perfect, any model that approximates a phenomenon will
have error associated with it, hence the model error variance ( 2 ) should be strictly
positive. A model error variance of zero in real life is highly unlikely. It is also suspected,
based on previous studies, that the model error variance for the regional skew model will
be modest, this is especially the case when sampling error dominates the model error (or
when the true model error variance is small compared to the sampling errors) (Reis, 2005
and Reis et al., 2005). For the mean flood, standard deviation and flood quantiles, the
model error variance tends to dominate the regional analysis. In this case a zero or negative
value for the model error variance is highly unlikely and a strict informative prior may not
be required. However it is known that the model error variance in this case may suffer bias
if it is estimated by a MOM estimator (Equation 3.14) (Stedinger and Tasker, 1986). This
may introduce further uncertainty into the regional model; hence a Bayesian estimator may
be attractive also in this case. A Bayesian estimator of the model error variance (Equation
3.31) as discussed above may be used to safeguard against uncertain model error variances,
as adopted in this study. Further details can be found in Reis et al. (2005) and Micevski and
Kuczera (2009). In summary, the Bayesian estimator offers a better way of dealing with the
model error variance and quantifying associated uncertainty about it.
The inverse-gamma distribution has been used in the past as it is a conjugate prior for
normal regression problems. However, for GLSR model described by Stedinger and Tasker
(1986) its use may not be attractive. The inverse-gamma is a heavy right-hand tailed
distribution; as such it can assign reasonably large probabilities for big variances when
CHAPTER 3
63
compared to other distributions such as those with exponential tails for example. Given that
issues may arise with the inverse-gamma distribution, in order to avoid these problems, an
exponential distribution is used for the prior. The exponential distribution, because of its
thin right-hand tail, is considered to be more consistent with what it is believed to be the
likely values of the model error variances for regional regression models. It also has a non-
zero pdf at zero, which would allow the data, represented by the likelihood function, to
provide information about the error variance near zero. The exponential pdf is:
2
),( 2 e (3.33)
Reis et al. (2005) provides a detailed discussion of the derivation of the choice of a prior
for the model error variance for regionalising the skew. For the regionalisation of skew, we
employed a value for the prior mean of the model error variance equal to 6 following Reis
et al. (2005), hence:
0,6),( 2622 e (3.34)
To derive the prior distribution for the standard deviation, mean flood and flood quantiles
of the LP3 distribution, we used an informative one-parameter exponential distribution
where the reciprocal of the fractional form of residual error variance estimate taken from
the OLSR is used as the prior mean of the model error variance. For example if the residual
error variance (2
OLS ) from the OLSR is 0.12, we take the inverse of this value i.e. 1/0.12 =
8.33. Hence, the prior distribution of 2 is an exponential distribution with mean equal to
1/8.33, therefore そ = 8.33.
0,33.8),( 233.822 e (3.35)
In should be made clear that the parameter そ can have varying influences on the estimated
coefficients of the regional regression model and on the estimated model error variances, as
such we choose そ values that are most likely to be very close to the real そ values (i.e. taking
the OLSR results of そ).
CHAPTER 3
64
3.6 SELECTING PREDICTOR VARIABLES
This section describes the approach adopted for selecting the predictor variables that should
be included in the prediction equations (regression models). The approach for selecting
predictor variables used in this study provides improvements over current methods used to
justify model selection in the BGLSR framework. Provided below is a discussion on the
BGLSR statistics that guided the model selection.
We use a procedure similar to forward stepwise regression utilising all the sites for each
state (separate regression for each state) and initially adopting just a constant term in the
regression equation. The model error variance and its standard error are noted. We then add
predictor variables starting with area followed by different combinations of other variables.
In all, 16 different combinations of predictor variables were used for the mean, standard
deviation and skew models, while 25 combinations were trialled for the flood quantile
models. Further information regarding the preparation and extraction of the catchment
characteristics can be found in Chapter 4 of this thesis.
3.6.1 AVERAGE VARIANCE OF PREDICTION
In RFFA, the objective is to make prediction at both gauged and ungauged sites; hence a
statistic appropriate for evaluation of model selection is the variance of prediction, which in
many cases depends on the explanatory variables at both gauged and ungauged sites.
Hence, Tasker and Stedinger (1989) suggested the use of the average variance prediction
(AVP). By using a GLSR model, one can predict a hydrological statistic on average over a
new region. Thus, this becomes the average variance of prediction (AVP)new for a new site
which is made up of the average sampling error and the average model error (Tasker and
Stedinger, 1986). For BGLSR analysis according to Gruber et al. (2007):
T
i
n
i
inew Varn
EAVP xyβx ]ˆ[1
][1
2 (3.36)
Also, if the prediction is for a site that was used in the estimation of the regional regression
model, the measure of prediction (AVP)old requires an additional term:
ii
T
i
n
i
iold Varn
EAVP eΛXXΛXxxyβx1T1T
12
1
2 )(2]ˆ[1
][ (3.37)
CHAPTER 3
65
where ei is a unit column vector with 1 at the ith row and 0 otherwise.
3.6.2 BAYESIAN AND AKAIKE INFORMATION CRITERIA
In this study both the Akaike and Bayesian information criteria are used as statistics for
model selection. The Akaike information criterion (AIC) developed by Akaike (1974) is
given by Equation 3.38. It is calculated based on the definition given by Greene (2003),
where SSTO is the Total-Sum-of-Squared deviations about the mean corrected for the
sampling error, n is the sample size for regression and k is the number of predictor
variables in the fitted regression model and 2GLSR is the pseudo coefficient of determination
used in BGLSR and is explained in section 3.6.4. The first term on the right hand side of
Equation 3.38 measures essentially the true lack of fit while the second term measures
model complexity which is related to the number of predictor variables. AIC is given by:
n
kR
n
SSTOAIC GLS
)1(2exp)1( 2 (3.38)
In practice, after the computation of the AIC for all of the competing models, one selects
the model with the minimum AIC value, AICmin..
The Bayesian information criterion (BIC) (Schwarz, 1978) is very similar to AIC, but is
developed in a Bayesian framework and is calculated based on the definition given by
Greene (2003):
n
k
GLS nRn
SSTOBIC
)1(
2 )1( (3.39)
The BIC penalises more heavily models with higher values of k than does AIC. Since
SSTO and 2GLSR depends on the sample size, the competing models can be compared using
AIC and BIC only if fitted using the same sample, as done in this study. As with the AIC,
one selects the model with the minimum BIC value, i.e. BICmin.
3.6.3 BAYESIAN PLAUSIBILTY VALUE
The significance of the regression coefficient values () obtained was evaluated using the
Bayesian plausibility value (BPV) as developed by Reis et al. (2005) and Gruber et al.
(2007), further reading of the mathematical derivations can also be read in the noted
CHAPTER 3
66
references. The BPV allows one to perform the equivalent of a classical hypothesis P-value
test within a Bayesian framework. The advantage of the BPV is that it uses the posterior
distribution of each parameter, which also reflects the prior. The BPV in this study was
carried out at the 5% significance level.
3.6.4 COEFFICIENT OF DETERMINATION
The traditional coefficient of determination (R2) measures the degree to which a model
explains the variability in the dependent variable. It uses the partitioning of the sum of
squared deviations and associated degrees of freedom to describe the variance of the signal
versus the model error. Traditionally for OLSR, the Total-Sum-of-Squared deviations
about the mean (SST) is divided into two separate terms, the Sum-of-Squared Errors
explained by the regression model (SSR) and the residual Sum-of-Squared Errors (SSE),
where SST = SSR + SSE.
Reis et al. (2005) proposed a pseudo co-efficient of determination ( )2
GLSR appropriate for
use with the GLSR. For traditional R2, both the SSE and SST include sampling and model
error variances, and therefore this statistic can grossly misrepresent the true power of the
GLSR model to explain the actual variation in the iy . Hence, for the GLSR a more
appropriate pseudo co-efficient of determination is defined by:
)0(ˆ
)(ˆ1
)0(ˆ
)](ˆ)0(ˆ[2
2
2
222
k
n
knRGLS (3.40)
where )(ˆ 2 k and )0(ˆ 2 are the model error variances when k and no explanatory variables are
used, respectively. Here, 2
GLSR measures the improvement of a GLSR model with k
explanatory variables against the estimated error variance for a model without any
explanatory variable. If )(ˆ 2 k = 0, 2
GLSR = 1 as it should, even though the model is not perfect
because var[ ii ] is still not zero because var[ i ] > 0.
3.6.5 OTHER MODEL SELECTION CRITERIA
A predictor variable having an estimated coefficient (other than the constant) that was less
than two posterior standard deviations away from zero was rejected (this shows the relative
CHAPTER 3
67
importance of the predictor) (Hackelbusch et al., 2009). In all the cases the simplest model
was preferred.
3.7 FORMATION OF REGIONS
The fixed region BGLSR analysis as above identifies the catchment characteristics that best
account for heterogeneity by minimising the model error variance. However, it is assumed
that there remains a possible spatial structure in the model error residuals. With this in
mind the model error variance therefore within possible sub regions of the fixed region
should be less than the fixed region model error variance. This is investigated further in this
study (see Chapter 5). It is in this framework that the ROI approach was applied to the
parameters (i.e. mean, standard deviation and skew) and flood quantiles of the LP3
distribution to further reduce the heterogeneity unaccounted for by the fixed region BGLSR
model.
The ROI approach in this study uses the distance between sites as the distance metric (i.e.
geographic proximity). We apply the ROI within the state boundaries (see Figure 3) in the
following way. For the ROI within the state boundaries, for the first iteration, the 15
nearest stations to the site of interest are selected and a regional BGLSR is performed and
the predictive variance (Equations 3.14 and 3.31) is noted. The initial number of stations
for the first iteration was chosen due to the fact that the smaller ROIs were causing the
BGLSR not to run, i.e. there was a lot of instability in the running of the model. The second
iteration proceeds with the next five closest stations being added to the ROI and repeating
the regression. This procedure terminates when all the sites in the region have been
included in the ROI. The ROI for the site of interest is then selected as the one which yields
the lowest predictive variance.
The ROI approach presented here is fundamentally different to that of Tasker et al. (1996)
in that it seeks to minimise:
(i) the regression model’s predictive error variance rather than selecting or
assuming a fixed number of sites that minimise a distance metric in catchment
characteristic space;
(ii) the ROI criterion of Tasker et al. (1996) cannot guarantee minimum predictive
variance; and
CHAPTER 3
68
(iii) moreover, the selection of sites that are minimally different in catchment
characteristic space may result in greater uncertainty in the estimated regression
coefficients.
It should be noted that the predictive error variance has two terms associated with it:
(i) the model error variance; and
(ii) the predictive variance arising from uncertainty in the estimated regression
coefficients.
The first term is the posterior expected value of the model error variance estimated using
the approach of Reis et al. (2005), see section 3.5.3 and Equation 3.31 – this is always non-
zero and guards against situations where the most likely value of the model error variance
is zero. The second term effectively guards against the ROI favouring fewer sites to
minimise the model error variance; indeed, as the number of sites is reduced, the model
error variance is likely to be offset by an increase in uncertainty in the estimated regression
coefficients (i.e.β ). Figure 3 illustrates the ROI approach as adopted in this study.
CHAPTER 3
69
Figure 3 Example of ROI techniques applied in this study
3.8 REGRESSION DIAGNOSTICS
The assessment of the regional regression model is made by using a number of statistical
diagnostics such as a pseudo–coefficient of determination (as discussed already in section
3.6.4) and the standard error of prediction. An analysis of variance for the BGLSR model is
also undertaken to examine which portion of the total error (sampling or model) dominates
the regional analysis for both the fixed region and ROI methods. This study also uses
Cook’s distance, the standardised residuals and Z-score analysis in a GLSR framework
which is used to identify outlier sites; absence of outlier in regression diagnostics indicates
the overall adequacy of the regional model. These statistics are described below.
Site of interest within
state boundaries
Site of interest within
state boundaries
CHAPTER 3
70
3.8.1 STANDARD ERROR OF PREDICTION
If the standardised residuals have a nearly normal distribution (to be determined in the
residual analysis, see below), the standard error of prediction in percent (SEP) (Tasker et
al., 1986) for the true flood quantile or parameter estimator is described by:
5.0]1)[exp(100(%) newAVPSEP (3.41)
3.8.2 RESIDUAL ANALYSIS
Important to this study is the assessment of the adequacy of the regional regression model
in its application to ungauged catchments. The measure of the raw residual (ri), which is the
difference between the sample (at-site estimate) and regional estimates of the LP3
parameter or flood quantile can be assessed initially for major deviations. However,
interpreting the raw residual may be misleading as the raw residual has three sources of
uncertainty: model error, sampling error and uncertainty due to regression coefficients
being unknown.
In this study, the standardised residual rsi is used, which is the raw residual divided by its
standard deviation defined as the square root of the sum of the predictive variance of the
LP3 parameter or flood quantile and its sampling variance given by the appropriate
diagonal element of the sampling covariance matrix. This yields the definition:
ΛxXΛXx
1Tofdiagonaltheisλwhere
rr iT
iii
isi 5.01 ])([ (3.42)
To assess the adequacy of the estimated LP3 parameters and flood quantiles from the QRT
and PRT, standardised residuals, referred to as Z-scores are used. For site i and a given
ARI, the Z-score is:
2
,
2
,
,,
,
ˆ
)ˆ(log)(log
iARIiARI
iARIeiARIe
iARI
QQZ
(3.43)
Here the numerator is the difference between the at-site flood quantile and regional flood
quantile (estimated from the developed prediction equation) and the denominator is the
CHAPTER 3
71
square root of the sum of the variances of the at-site (2
ARI,ij) and regional ( 2
i.ARIj ) flood
quantiles in natural logarithm space.
It is reasonable to assume that the errors in the two estimators are independent because
iARIQ , is an unbiased estimator of the true quantile estimators based upon the at-site data,
whereas the error in iARIQ ,ˆ is mostly due to the failure of the best regional model to estimate
accurately the true at-site flood quantile. The use of log space makes the difference
approximately normally distributed and hence enables the use of standard statistical tests.
3.8.3 COOK’S DISTANCE
Tasker and Stedinger (1989) developed measures such as Cook’s distance (Di) from an
OLSR to GLSR case. Tasker and Stedinger (1989) and Reis et al. (2005) suggested that
influence is large when Di is greater than 4/n, where n is the number of sites in the region.
Further reading on the mathematical derivations associated with Cook’s distance can be
found in the noted references.
3.9 EVALUATION STATISITCS
A LOO cross validation procedure is applied to assess the performance of the different
RFFA methods. The site that is left out in building the model is in effect being treated as an
ungauged site. Since all the sites in the database are being treated as ungauged for ROI this
automatically satisfies the LOO validation approach. The following performance statistics
are calculated from the fixed and ROI analysis: absolute (abs) median relative error (REr)
in % over n sites, the relative root mean square error (RMSEr) in % and the average ratio
(rr) of the predicted flood quantile to observed flood quantile as described below.
i
ii
obs
obspred
n
iQ
QQabs
1
r MedianRE
(3.44)
(3.45)
n
i iobs
ipred
Q
Q
n 1
r
1r (3.46)
n
i obs
obspred
r
i
ii
Q
n 1
2
1RMSE
CHAPTER 3
72
where iobsQ is the observed flood quantile at site i obtained from at-site flood frequency
analysis estimated using FLIKE (Kuczera, 1999a), ipredQ is the predicted flood quantile at
site i from the regional prediction equation from QRT and PRT and n is the number of sites
in the region. The REr (%) and RMSEr (%) provide an indication of the overall accuracy of
the regional model. The model with minimum REr is always preferred. For RMSEr the
smallest value between the two competing models with the same number of parameters is
generally preferred. It should be noted here that both the Qpred and Qobs values have
uncertainties associated with them, and in particular, the Qobs values are subject to errors
due to the annual maximum flood record length, rating curve extrapolation errors, selection
of probability distribution and associated parameter estimation procedures. The above error
statistics thus give some guidance about the relative accuracy of the method and should not
be taken as the true uncertainty associated with the method.
The average value of the Qpred/Qobs (rr) gives an indication of the degree of bias (i.e.
systematic over- or under estimation), where a value of 1 indicates good average agreement
between the Qpred and Qobs as both of these values are essentially random variables. An rr
value in the range of 0.5 to 2 may be regarded as ‘desirable (D)’, a value smaller than 0.5
may be regarded as ‘gross underestimation (U)’, and a value greater than 2 may be
regarded as ‘gross overestimation (O)’. It should be mentioned here that these are only
arbitrary limits that are set at a relatively large width band recognising the significant
uncertainty in the estimates from a RFFA method in Australia and hence would only
provide a reasonable guide about the relative accuracy of the methods as far as the practical
application of the methods are concerned.
3.10 REGIONAL UNCERTAINTY WITH FLOOD QUANTILE
ESTIMATION
For the ARIs considered in this study being the (ARIs 2 – 100 years) in this section we
only consider the uncertainty in the regional flood quantile estimation based on the
BGLSR-PRT and ROI where the ROI without state borders is used. In the annual
maximum series models, the mean flood ( ), standard deviation of floods ( ) and
skewness ( ) are considered as regional variables (i.e. the regional at-site estimates of the
LP3 parameters). The regional T-year event estimate for the PRT is given by the following
Equation.
CHAPTER 3
73
)(eeeR
RTReRT KQ (3.47)
where the subscript ‘Re’ refers to the site where the regional estimation is made.
The uncertainty associated with the regional T-year event estimate can be found by
combining the BGLSR method with multivariate normal distribution (MVN). The
advantage of using the BGLSR is that it provides an estimate of the annual maximum series
hydrologic statistics and their associated posterior variances. The posterior variance reflects
the uncertainty related to the residual regional heterogeneity (model error variance) as well
as sampling variability corrected for inter-site correlation while also reflecting the prior
used. Thus the model error variance term is found by Equation 3.31 for the regional
estimate of the , and parameters. These regional values along with the MVN can be
used to quantity the uncertainty in the flood quantile estimates by deriving the 90%
confidence limits.
3.10.1 THE MULTIVARIATE NORMAL DISTRIBUTION
The MVN distribution model extends the univariate normal distribution model to fit vector
observations. An np-dimensional vector of random variables may be defined as follows
npiYYYYY inp ,...,1,,..., 21 (3.48)
is said to have a multivariate normal distribution if its density function f(Y) is of the form
yYyYYYYfYfT
np 12/12/
21 2/1exp2/1),...,()( (3.49)
where y = (y1, …, ynp) is a vector of means (i.e. in this case the regional at-site estimate of
the hydrological statistic of interest) and is the variance-covariance matrix of the MVN
distribution. This can also be given by the notation in Equation 3.50. The variance for use
with the MVN distribution is taken from the BGLSR analysis (i.e. the posterior variances
for each parameter estimate, see Figure 4).
),( ynpNY (3.50)
CHAPTER 3
74
For the univariate case, when np = 1, (i.e. parameter of the LP3 distribution) the one-
dimensional vector Y =Y1 has the normal distribution with mean y and variance 2ˆ .
For the bivariate case, when np = 2, (i.e. and parameters of the LP3 distribution), Y =
(Y1, Y2) has the bivariate normal distribution with two-dimensional vector of means, y =
(y1, y2) and covariance matrix with the correlation (ρ) between the two random variables is
given by:
2
,
,
2
ˆˆˆ
ˆˆˆ
stdevstdevmeanstdevmean
stdevmeanstdevmeanmean
(3.51)
For the trivariate case, when np = 3, (i.e. , and parameters of the LP3 distribution),
Y = (Y1, Y2, Y3) has the trivariate normal distribution with three-dimensional vector of
means, y = (y1, y2, y3) and covariance matrix with the correlation (ρ) between the three
random variables is given by:
2
,,
,
2
,
,,
2
ˆˆˆˆˆ
ˆˆˆˆˆ
ˆˆˆˆˆ
skewskewstdevskewstdevskewmeanskewmean
skewstdevskewstdevstdevstdevmeanstdevmean
skewmeanskewmeanstdevmeanstdevmeanmean
(3.52)
By using Equations (3.50 and 3.52), 10,000 values are generated for each of the mean,
standard deviation and skew of the LP3 distribution (see Equation 3.47).
The T-year flood quantile is then fitted (see Equation 3.47) such that there will be 10,000
eRTQ . TheeRTQ values are then ranked in ascending order of magnitude and the 5th, 50th and
95th percentile values are extracted. Figure 4 provides a good summary of the important
steps involved in deriving the confidence limits.
CHAPTER 3
75
Figure 4 Use of multivariate normal distribution to develop confidence limits by Monte Carlo
simulation
Mean
N~ (y1, mean )
Standard Deviation
N~ (y2, stdev )
Skew
N~ (y3, skew )
Correlation between parameters
stdevmean, , skewmean, , skewstdev,
Simulate 10,000 sets of mean, standard deviation and skew from
the multivariate normal distribution
Obtain 10,000 values of ReTQ from
Equation 3.47.
Order the 10,000 ReTQ values in
ascending order and extract the 5th
and 95th percentile values
CHAPTER 3
76
3.11 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION
MODELS – METHODOLOGY
3.11.1 THE HYDROLOGICAL REGRESSION PROBLEM
Suppose we have a dataset of n sites in a region with k potential catchment characteristics
variables (independent variables) xi1, xi2,…, xik and a response variable yi (i = 1, 2,…, n)
which can be a flood statistic (i.e mean flood)/quantile. The relationship between the
response and independent variables is often assumed to be linear. There are also a few
assumptions made on the data for hydrological regression; for instance, the dataset are
representative of the regression relationship to be developed and the random errors are
homoscedastic, (more about this can be seen in section 3.3.1). The OLSR and GLSR based
regional regression model can be written in matrix notation as:
iXβy (3.53)
Where y = (y1, y2,…,yn)T is the response vector of flood quantiles or flood statistic of
interest (the superscript ‘T’ denotes the transpose), X = (xi,j) (i = 1, 2,…, n; j = 1, 2,…, k) is
a [n k] matrix, is a k-dimensional vector of unknown regression coefficients to be
estimated, is a n 1 random error vector assumed to have mean zero and covariance
matrix defined by:
Ω)E(2 T (3.54)
wherein 2 is the model error variance and is a positive definite symmetric matrix
(Johnston, 1972; Rencher, 2000; Koop, 2005). Different choices of the matrix allow one
to make different assumptions regarding the nature of the model errors. If Ω is equal to the
identity matrix nI , the problem is homoscedastic, and the GLSR model reduces to OLSR.
In more general cases when is defined to reflect heteroscedastisity and correlation among
residuals GLSR is a more reasonable estimator.
Stedinger and Tasker (1985, 1986) developed a GLSR model for regional hydrologic
analysis. The important difference from OLSR is the development and partition of the
covariance matrix of the errors. The GLSR model assumes that the total error results from
CHAPTER 3
77
two sources: model errors i that are assumed to be independently distributed with mean
zero 0iE and a common variance:
ji0ji,Cov
2 ji (3.55)
and sampling errors that arise due to the fact the actual values of yi are unknown and only
estimates of the quantities of interest are available.
Therefore, Equation 3.53 becomes (following Reis et al., 2005):
iXβhηXβy ˆ (3.56)
where is the sampling error in the sample estimators. Thus, the regression-model errors i
are a combination of: (i) time-sampling-error in sample estimators i iy of yi and (ii)
underlying model error i (lack of fit). The total error has mean zero and covariance
matrix:
yIΛii ˆ22 TE (3.57)
where Σ( y ) is the covariance matrix of the sampling errors in the sample estimators. Time-
sampling errors in estimators of the yi’s are usually correlated among sites because flows at
nearby sites have similar hydrological mechanisms (e.g. meteorology). Reasonably
accurate estimation of the sampling covariance matrix in the GLSR is very important and is
of great concern and is vital to the solution of the GLSR equations. More details about the
construction of Σ( y ) for flood quantiles and statistics can be read in Stedinger and Tasker
(1985 and 1986), Reis et al. (2005), Griffis and Stedinger (2007) and section 3.5.1 of this
thesis.
In both the regression approaches (OLSR and GLSR), the true values (i.e. the regression
coefficients) are unknown. To be able to determine the best possible model, it is necessary
to decide which of the different s’ should be included in the model. In typical ordinary
stepwise regression this is equivalent to selecting the best set of independent variables for a
regression model. Considering the case that uses GLSR (see Equation 3.56), where a more
parsimonious model may be true such that:
CHAPTER 3
78
iXy (3.58)
where is a subset of 1, 2,…, k, X indicates the matrix whose columns are the ones in
X that are indexed by the integers in , CR ,Λ indicates the sampling covariance matrix
whose rows and columns are the ones in Λ that are indexed by and indicates the vector
whose components are the ones in that are also indexed by the integers in . Hence there
are in total 2k-1 possible different models of the form represented by Equation 3.58. For the
model of the form of Equation 3.56, if is selected, the model is fitted based on Equation
3.58:
yΛXXΛXβ ˆ)()(ˆ 21121
,, CRCRGLSR
TT (3.59)
where GLSRβ is an estimate of β , when Σ( y ) = 0 Equation 3.59 reduces to the OLSR
solution. Further information on this can be seen in Stedinger and Tasker (1985 and 1986).
Equation 3.59 is solved by employing an iterative procedure using a MOM estimator (see
Stedinger and Tasker, 1985 and section 3.3.1 of this thesis).
After determining the model for use in hydrological regression the overall performance of
the model is then evaluated according to its prediction ability, e.g. how well a model can
predict flood quantiles for ungauged catchments? In most regression applications, the mean
squared error of prediction (MSEP) of a model represents its prediction ability. In practice
the lower the MSEP, the better is the prediction ability of the model.
3.11.2 MODEL SELECTION BY MONTE CARLO CROSS VALIDATION
In statistical inference the term cross validation is usually used and has a wide meaning,
however to avoid ambiguity this study uses the general term ‘validation’ associated with
either LOO or MCCV. In general, validation attempts to select a model based on the
prediction ability of the model (Breiman et al., 1984; Zhang, 1993 and Burman, 1989). For
general validation, when is selected, the n datasets (denoted by S) are split into two parts.
The first part (calibration set), denoted by Sc (with corresponding submatrix XSc and
subvector ySc), contains nc datasets for fitting the model.
CHAPTER 3
79
The second part (validation set), denoted by Sv (with corresponding submatrix XSv and
subvector ySv), contains nv = n - nc datasets for validating the model. There are in total vn
nC different forms of split samples. For each of the split samples, the model is fitted by
the nc dataset of the first part of Sc (Equation 3.59) to obtainGLSRS c ,
ˆ β . The datasets in the
validation set (which are essentially gauged catchments) are treated as if they are
ungauged. The fitted model then can predict the response vector y Sv:
GLSRS
T
SS cvv ,ˆˆ βXy (3.60)
The average squared prediction error (ASPE) over all the dataset in the validation set is:
2
)ˆ(1
);(ASPEvv SS
v
vn
S yy (3.61)
Therefore, letting S be the set whose elements are all from the validation sets corresponding
to (n combination nv) different forms of sample split. The cross validation criterion with nv
datasets left out for validation is defined as:
v
vSS
n
nn
SASPEV v
v
);()(
(3.62)
where )(vnV is calculated for every . Equation 3.62 serves as an approximation of
MSEP() in the situation of finite samples. Although LOO validation can select a model
with bias b = 0 as n, it can however include unnecessary additional independent
variables in the model. In this case the true model dimension k is not considered to be the
most parsimonious and can lead to uncertainty in estimation due to over fitting. For general
validation it has been proven, under the conditions nc and nv/n1 (Shao, 1993), that
the probability for validation (with nv datasets left out for validation) to choose the model
with the best prediction ability tends to one. In this framework the )(vnV (Equation 3.62) is
asymptotically consistent; however, this is not the case for the computation of vnV with
large nv. In such situations, MCCV is an easy and effective procedure.
CHAPTER 3
80
For a selected , randomly split the dataset into two parts Sc(i) (of size nc) and Sv(i) (of size
nv). Repeat the procedure N times. The repeated MCCV criterion is defined as:
N
i
isis
v
n vvv Nn 1
2
)()( )ˆ(1
)(MCCV yy (3.63)
3.11.3 ESTIMATING MSEP
In hydrological regression, the estimate of MSEP is generally based on a finite dataset and
datasets that are very small. Here we mainly consider using LOO or MCCV methods to
estimate MSEP. As was noted in Efron (1986), the estimate of MSEP using observed data
would tend to underestimate the true MSEP for new future observations, since the data
have been used twice, both to fit the model and to check its accuracy. The results obtained
however are an optimistic estimate at most of the models’ true prediction error. For
Equation 3.58 the LOO validation criterion is:
)(min)( 1*11 VV (3.64)
where )( 11 V is obtained by Equation 3.62 (nc = 1) and *1 denotes the optimal model index in
Equation 3.64. In all cases it should be mentioned that MSEP depends on the size of the
calibration data set. MCCV can also be utilised to make the prediction. However, since
MCCV uses only nc datasets for calibration it is considered unnecessary to use
)(MCCV *vv nn to estimate MSEP for the model with n datasets if nv is a large number.
Let )( *vn denote the optimal model index in Equation 3.63. The expected difference between
)(MCCV *vv nn and the mean squared error of prediction for the selected model is
:
nn
n
c
v*
n
*
nn )MSEP()(MCCVEvvv
(3.65)
If a large portion of the dataset is left out for validation, the mean squared error should not
be very small. In such cases, )(MCCV *vv nn might be a poor estimate of )(MSEP *
vn (Burman,
CHAPTER 3
81
1989). In order to obtain slight improvements in accuracy of estimation, a correction term
is needed for )(MCCV *vv nn (Burman, 1989) given by:
2
)(1
2*
n
*
n )ˆ(11
)ˆ(1
)(MCCV)(CMCCV,
***,
*vv iS
N
i
nnGLSRcvnvnGLSRvnvnvv nNn βXyβXy (3.66)
where *,
ˆGLSRvnβ in the second term is estimated based on n catchments and
)(,*
ˆiS GLSRcvnβ in the
third term is estimated based on nc catchments in Sc(i) (i=1, 2,…, N). )(MCCV *
vv nn
indicates the average prediction ability of the model with nc catchments and, as stated
above, it overestimates the MSEP of the model with n catchments. The second term in
Equation 3.66 is the average residual sum of the squares of the model with n catchments.
The third term in Equation 3.66 is the average residual sum of squares and prediction error
of the model based on nc catchments. The latter two terms combine the effects of the model
both with nc and with n catchments.
3.11.4 APPLICATION – USING SIMULATED DATA
Two Monte Carlo experiments are reported in this thesis which compared both OLSR and
GLSR using LOO and MCCV. In the Monte Carlo simulation the following model is
considered:
ioi iiixxxy 332211
(3.67)
Where yi is the dependent variable data and is taken as the 20 year ARI, the simulated
values are assumed to be independent normally distributed random variables estimated at i
= 1, 2,…, 50 which represents 50 stations: (8 sites with 50 years of data, 8 sites with 40
years of data, 12 sites with 32 years, 12 sites with 25 years of data and 10 sites with 15
years of data, this corresponds to an average record length of 33 years which is the average
record length for most Australian catchments). Based on a previous analysis for Australian
data in the Monte Carlo simulation analysis for the GLSR model to estimate the total error
i we took 2 to be low to high random errors in the range of 0 to 1 i.e. N(0, ) where
is 0.25 and 0.95. For the GLSR estimator we also need an estimate of the diagonals of
Σ( y ) (covariance matrix of sampling errors), for normally distributed yi (see Equations 5
CHAPTER 3
82
and 6, pg, 1422 from Stedinger and Tasker 1985) are adopted to generate the sampling
variance for each site of Σ( y ). Furthermore, to estimate the off-diagonal elements of Σ( y )
we also require estimates of cross correlations ( ij ) between concurrent record lengths (i.e.
)ˆ,ˆ(ˆji yy ) in the region. In the Monte Carlo simulation we generated cross correlated data
for ij = 0.30 (modest constant cross correlation between sites) and ij = 0.70 (medium to
high constant correlation between sites).
For OLSR, i are from N(0, ) i.e. standard normal, where are taken as 0.2 and 1
representing low level (smaller spread) and high level (larger spread) random errors
respectively. Hereikx is the ith value of the kth variable xk , and the values of
ikx (k = 1, 2,
3; i = 1, 2,…, 50) and x represents three catchment descriptors (i.e. independent variables)
sampled randomly from uniform and normal distributions in the interval of U[5, 1000] for
x1, N(x1, 0.21) for x2 and , U[2, 20] for x3. The logarithms (base 10) of these descriptor
variables are used in the simulation. To make the simulation more meaningful we also
explore the influences of correlated descriptors (i.e. collinearity) which is very common in
hydrological regression. The collinearity is explored using LOO and MCCV with both
OLSR and GLSR. In this study, we allow x2 to have a high degree of collinearity with x1
(such that the correlation coefficient between x1 and x2 is taken to be 0.90, see above). In
the Monte Carlo simulation, all different combinations of x1, x2 and x3 are considered and
the model with best prediction ability is selected. The resulting regional models for OLSR
and GLSR are respectively:
ii iOLSRxy 1402.096.2ˆ (3.68)
ii iGLSRxy 1517.095.2ˆ (3.69)
The size of the validation sets is taken to be nv = 15, 20, 25, …, 45, and the number of
simulations 500. In order to assess the obtained model, a further 2,000 datasets are
generated using the above procedure for the purpose of prediction. These datasets are used
to calculate the MSEP for the models selected by LOO, MCCV and the assumed true
model.
CHAPTER 3
83
3.11.5 OBSERVED REGIONAL FLOOD DATA FROM NSW, AUSTRALIA
A total of 96 unregulated rural catchments are selected from New South Wales (see more
details in Chapter 4). The geographical distributions of these catchments are shown in
Figure 5. The catchment areas are considered to be small to medium sized (I. E. Aust.,
1987) ranging from 3 to 1000 km2 (mean: 353 km2 and median: 267 km2). The annual
maximum flood series record lengths range from 25 to 75 years (mean: 37 years, median:
34 years and standard deviation: 11.4 years). More information regarding the preparation of
the streamflow data can be found in Haddad et al. (2010a) and also Chapter 4 of this thesis.
The annual maximum flood series are assumed to follow the LP3 distribution for two
reasons: (i). The LP3 distribution is the currently recommended at-site flood frequency
probability model by ARR (I. E. Aust., 1987) and ii). It has also shown consistently better
results in past studies for Australian catchments (see e.g. Haddad et al., 2010a; Haddad et
al., 2012a and Haddad and Rahman, 2012). The LP3 distribution is fitted using a Bayesian
parameter fitting procedure (Kuczera, 1999a) for quantiles of ARIs of 10 and 100 years.
These two ARIs are chosen because they cover both the high and low sides of the flood
distribution.
To apply the GLSR to regionalise the flood quantiles the sampling covariance matrix Σ( y )
of the LP3 distribution is required. Tasker and Stedinger 1989 and Griffis and Stedinger
(2007) (p. 84, Equation 4, also see Equation 3.16 in this thesis) provide the approximate
estimator of the components of Σ( y ) matrix of the LP3 distribution. It should be mentioned
here that other distributions like GEV could have been adopted; however, it is unlikely to
affect the outcomes of the analysis. Furthermore, the LP3 distribution has been found to
outperform the GEV distribution generally for eastern Australia (Zaman et al., 2012). The
skew and standard deviation in the Σ( y ) matrix are subject to estimation uncertainty. In
this study to avoid correlation between the residuals and the fitted quantiles, the following
procedures are adopted:
(iv) the inter site correlation between the concurrent annual maximum flood series
(ρij) is estimated as a function of the distance between sites i and j;
(v) the standard deviations (of the logarithms of annual maximum flood series) σi
and σj are estimated using a separate OLSR and using the predictor variables
used in the study (see below); and
CHAPTER 3
84
(vi) the regional skew (of the logarithms of annual maximum flood series) is used in
place of the population skew as suggested by Tasker and Stedinger (1989).
This analysis above uses the regional estimates of the standard deviation and
skew obtained from GLSR. The detailed information on the covariance matrices
associated with the standard deviation and skew can be found in Reis et al.
(2005) and Griffis and Stedinger (2007) and Equations 3.17 and 3.18 – 3.22).
Twelve climatic and catchment characteristics variables were selected. (More information
regarding the extraction and preparation of the catchment characteristics can be found in
Chapter 4 of this thesis). The predictor variables were log-transformed (base 10) and
centered around the mean for the regression analysis.
3.12 SUMMARY
A number of statistical techniques and formulations to be used in this thesis have been
presented in this chapter.
On the onset of this chapter, fitting the LP3 distribution to the observed flood data using a
Bayesian parameter fitting procedure has been presented. The GLSR procedure has then
been discussed both in its classical application and in hydrologic regression context to
derive regional regression equations relating flood quantiles to catchment and climatic
characteristics using both a QRT and PRT framework. The Bayesian GLSR (BGLSR)
regression procedure was discussed in more detail. The setting up of the residual error
covariance matrices with the BGLSR approach has also been discussed. This chapter has
also discussed the formation of regions in RFFA which has included both the fixed and
region of influence approaches.
The second part of this chapter discussed the mathematical formulations used in the model
validation in the context of hydrologic regression analysis using OLSR and GLSR. The
statistical framework for the numerical experimentation and practical application
demonstrating the use of LOO and MCCV in hydrologic quantile regression analysis has
also been discussed.
The next chapter will discuss the study areas and the different aspects of streamflow and
catchment characteristics data collation and preparation.
CHAPTER 4
85
CHAPTER 4: STUDY AREA AND PREPARATION OF
STREAMFLOW AND CATCHMENT CHARACTERISITICS DATA
4.1 GENERAL
The assembly and preparation of streamflow data is an important step in any regional flood
frequency analysis (RFFA) study. This chapter describes various aspects of the streamflow
data collation adopted for this work e.g. selection of the study area, selection of stream
gauging sites, checking annual maximum streamflow data, filling gaps in the streamflow
data series, checking rating curve extrapolation errors associated with the streamflow data
series, checking for outliers in the data series and testing for any significant trends that
could undermine the purpose of flood frequency analysis.
Because this study is primarily concerned with developing regional prediction equations for
design flood estimation using both a quantile and parameter regression technique; an
elementary step in any regional study such as this involves obtaining both climatic and
catchment characteristics data. Identifying the most relevant catchment characteristics is
difficult as there is no objective method for doing this; also many catchment characteristics
are highly correlated, thus the presence of many of these in the model can cause problems
with statistical analysis such as introducing multicollinearity and secondly it does not
provide any extra useful information.
Rahman (1997) indicated that there is no objective method for selecting catchment
characteristics, thus an initial selection of candidate characteristics should be based on an
evaluation and success of catchment characteristics used in past studies. Rahman (1997)
considered in detail all possible climatic/catchment characteristics (referred as catchment
characteristics henceforth) from over 20 previous studies to develop a reasonable starting
point.
Nevertheless, no general inference about the significance of a particular catchment
characteristic can be made on the fact that an investigator has found it to be significant,
since in a regional study such as this dominant characteristics may vary from region to
region.
CHAPTER 4
86
In the second part of this chapter, the catchment characteristics to be used in this thesis are
selected with the aim of developing a working database of catchment characteristics.
Initially the selection of candidate catchment characteristics is described in sufficient detail
and aspects of data collation/collection are presented later.
4.1.1 PUBLICATIONS
A Journal paper (ERA, rank B) has been published on the materials presented in this
chapter. This journal paper is given in Appendix A. The following is the reference of the
paper.
Haddad, K., Rahman, A., Weinmann, P.E., Kuczera, G. and Ball, J.E. (2010a).
Streamflow data preparation for regional flood frequency analysis: Lessons from south-east
Australia. Australian Journal of Water Resources, 14 (1), 17-32.
4.2 STUDY AREA
For this study, the Australian continent is selected as the study area. For flood quantile
estimation in the range of 2 – 100 years average recurrence intervals (ARI), the quantile
and parameter regression techniques (QRT and PRT) in a Bayesian generalised least
squares regression (BGLSR), fixed and region of influence, (ROI) frameworks are applied
in the states of Queensland (QLD), New South Wales (NSW), Victoria (VIC) and
Tasmania (TAS). The model validation case study makes use of the data from NSW, while
for the large flood analysis using the large flood regionalisation model (LFRM), 626
stations are used from all over the Australian continent, excluding the arid and semi arid
regions. The selected study area is shown in Figure 5.
CHAPTER 4
87
Figure 5 Plot of the selected study area (i.e. NSW, VIC, QLD and TAS)
4.3 SELECTION OF CANDIDATE CATCHMENTS
The following factors and criteria were considered in making the initial selection of the
study catchments.
Catchment area: The proposed regionalisation study aims at developing prediction
equations for flood estimation in small to medium sized ungauged catchments. Since the
flood frequency behaviour of large catchments has been shown to significantly differ from
smaller catchments, the proposed method should be based on small to medium sized
catchments. ARR (I. E Aust, 1987) suggests an upper limit of 1000 km2 for small to
medium sized catchments, which seems to be reasonable and is adopted here. For larger
catchments, the flood frequency curves are generally flatter as compared to the smaller
ones. Since the focus of RFFA technique is design flood estimation to small ungauged
catchments, the use of very large catchments in the development of RFFA techniques is not
justified as per ARR (I. E Aust, 1987).
CHAPTER 4
88
Record length: The streamflow record at a stream gauging location should be long enough
to characterise the underlying probability distribution with reasonable accuracy. In most
practical situations, streamflow records at many gauging stations in a given study area are
not long enough and hence a balancing act is required between obtaining a sufficient
number of stations (which captures greater spatial information) and a reasonably long
record length (which enhances accuracy of at-site flood quantile estimates. Selection of a
cut-off record length appears to be difficult as this can affect the total number of stations
available in a study area. However for this study, the stations having a minimum of 10
years of annual instantaneous maximum flow records are selected initially as ‘candidate
stations’. This is because that sample size smaller than 10 years may not be useful in RFFA
in Australia as this often suffers from long periods of droughts and flood quantile estimates
with smaller record lengths this may provide biased results. Here 10 years is the cut-off
record length; however, the adopted threshold was 24 years for most of the Australian
states as noted later in this chapter.
Regulation: Ideally, the selected streams should be unregulated, since major regulation
affects the rainfall-runoff relationship significantly (storage effects). Streams with minor
regulation, such as small farm dams, may be included because this type of regulation is
unlikely to have a significant effect on annual floods. Gauging stations subject to major
regulation are not included.
Urbanisation: Urbanisation can affect flood behaviour dramatically (e.g. decreased
infiltration losses and increased flow velocity). Therefore, catchments with more than 10%
of the area affected by urbanisation are not included in the study.
Landuse change: Major landuse changes, such as the clearing of forests or changing
agricultural practices modify the flood generation mechanisms and make streamflow
records heterogeneous over the period of record length. Catchments which have undergone
major land use changes over the period of streamflow records are not included in the data
set.
Quality of data: Most of the statistical analyses of flood flow data assume that the available
data are essentially error free; at some stations this assumption may be grossly violated.
CHAPTER 4
89
Stations graded as ‘poor quality’ or with specific comments by the gauging authority
regarding quality of the data were assessed in greater detail; if they are deemed ‘low
quality’ they are excluded. For example, if there were lots of missing data, and the gauging
station location was sifted a long way from the previous location, the station was excluded.
4.4 STREAMFLOW DATA PREPARATION
4.4.1 FILLING MISSING RECORDS IN ANNUAL MAXIMUM FLOOD SERIES
Missing observations in streamflow records at gauging locations are very common and one
of the elementary steps in any hydrological data analysis is to make decisions about dealing
with these missing data points. Missing records in the annual maximum flood series are in-
filled where the extra data points can be estimated with sufficient accuracy to contribute
additional information rather than ‘noise’. For this study, one of the following methods (a
or b) is applied, as documented in Rahman (1997) and Haddad et al. (2010a).
(a) Comparison of the monthly instantaneous maximum (IM) data with monthly
maximum mean daily (MMD) data at the same station for years with data gaps. If a
missing month of instantaneous maximum flow corresponds to a month of very low
maximum mean daily flow, then that is taken to indicate that the annual maximum
did not occur during that missing month.
(b) Application of a linear regression between the annual maximum mean daily flow
series and the annual instantaneous maximum series of the same station. Regression
equations developed are used for filling gaps in the IM record, but not to extend the
overall period of record of instantaneous flow data.
For in-filling the gaps, Method (a) is preferred over Method (b), as it is more directly
based on observed data for the missing month and involves fewer assumptions.
4.4.2 TREND ANALYSIS
Hydrological data for any flood frequency analysis, be it at-site or regional, should be
stationary, consistent and homogeneous. The annual maximum flow series should not show
any time trend to satisfy the basic assumption of stationarity with traditional flood
frequency analyses methods. Thus, in this study, a trend analysis is carried out where
CHAPTER 4
90
possible to identify stations showing significant trend and the stations which do not show
any trend are included in the primary data set for each Australian state.
Two tests are initially applied to detect time trend, the Mann–Kendall test (Kendall, 1970)
and the distribution free CUSUM test (McGilchrist and Wodyer, 1975); both tests are
applied at the 5% significance level. The Mann-Kendall test is concerned with testing
whether there is an increase or decrease in a time series, whereas the CUSUM test
concentrates on whether the mean values in two parts of a record are significantly different.
As a useful guide and in addition to the trend tests, a simple time series plot and a
cumulative flow graph of the station are also used to detect shifts in the annual maximum
flood data.
4.4.3 RATING CURVE ERROR AND IDENTIFICATION
Most stream gauging authorities establish a network of streamflow gauging stations to
obtain continuous streamflow data. However, in most cases, these do not measure the
actual discharge directly. Rather it is the stage that is recorded, and subsequently
transformed to discharge by means of an estimated rating curve, which is constructed in
most cases by correlating measurements of discharge with the corresponding observations
of stage. However, the range of observed flood levels generally exceeds the range of
‘measured’ flows, thus requiring different degrees of extrapolation of well established
rating curves. Thus, most of the discharges calculated by rating curve are subject to
uncertainty. Different methods of rating curve extrapolation are associated with a range of
assumptions, from simple extension of fitted regression lines to hydraulic analysis methods
requiring additional data. The magnitude of rating curve extrapolation errors depends on
the stream and flood plain conditions near the gauging station, the strengths of the
assumptions made in extrapolation, and the degree of extrapolation beyond the range of
measured flows (Kuczera, 1999a).
Any rating curve extrapolation error is directly transferred into the largest observations in
the annual maximum flood series, and use of these extrapolated data in flood frequency
analysis can result in grossly inaccurate flood estimates, particularly for higher ARIs.
There are several studies that have examined the uncertainty of a single discharge estimate
due to rating curve variability using a regression-based approach, e.g., Venetis (1970),
CHAPTER 4
91
Dymond and Christian (1982) and Reitan and Petersen-Øverleir (2008). On the other hand,
the impact of rating curve error and imprecision in the estimation of the flood quantile has
received less attention in hydrological literature (Petersen-Øverleir and Reitan, 2009).
Potter and Walker (1981), Rosso (1985), Shuzheng and Yinbo (1987) and Kuczera (1992,
1996) provided some insights into the problem by analysing a multiplicative error model.
Kuczera (1996) and Reis and Stedinger (2005) adopted a multiplicative error model in a
Bayesian framework to deal with rating curve error. From these studies, the main
conclusion to be drawn is that multiplicative measurement error introduces bias into
estimated flood quantiles.
In this study, the stations having annual maximum flood data associated with high degree
of rating curve extrapolation are identified by introducing a ‘rating ratio’ (RR). The annual
maximum flood series data point for each year (estimated flow QE) is divided by the
maximum measured flow (QM) for that station to define the rating ratio (See Equation 4.1).
Moreover the rating ratio is based on the highest measured flow over the total period of
record, and the annual maximum flows are based on the gauging authorities’ best estimate
of the rating curve applicable at the time of that flow event.
M
E
Q
QRRRatioRating )( (4.1)
If the RR value is below or near 1, the corresponding annual maximum flow may be
considered to be free of rating curve extrapolation error. However, a RR value well above 1
indicates a rating curve error that can cause notable errors in flood frequency analysis.
As an example, for Station 222202, there are 11 data points with RR values greater than 1
(27% of total data points) and the maximum value of RR is 5.5 (Figure 6). This large
degree of rating curve extrapolation is likely to affect flood frequency estimates at this
station, especially the higher ARI floods such as Q50 and Q100, unless appropriate measures
are taken. The application of RR is discussed further in the latter part of this chapter.
For any RFFA, a large number of stations with reasonably long record lengths are required
and hence a trade-off needs to be made between an extensive data set that includes stations
with very large RR values (and thus lower accuracy) and a smaller data set with RR values
CHAPTER 4
92
restricted to what could be considered to be a “reasonable upper limit” of rating curve
errors.
A working method to decide on a cut-off RR value is determined by looking at the average
and the maximum RR values for each station in a region/state. Based on the results from
VIC and NSW, the RR values found to represent a reasonable compromise between
accuracy at individual sites and total size of the regional data set are an average of 4 and a
maximum of 20.
Likely Rating Curve Error - 222202
0
1
2
3
4
5
6
0 5 10 15 20 25 30 35 40 45Data Point
QE/Q
M
Data points subject to possible rating curve errors
Figure 6 Plot of rating ratios (RR) for station 222202
4.4.4 SENSIVITY ANALYSIS AND IMPACT OF RATING CURVE
EXTRAPOLATION ON FLOOD QUANTILE ESTIMATES
Typically error arising from rating curve extension is smooth and can therefore introduce
systematic error of both over- or under-estimation of the true discharge. The rating curve
extension error coefficient of variation is not well known, however Potter and Walker
(1981) suggest it could be as high as 30% in poor situations, such as errors in the
extrapolation zone (see Figure 7). In the interpolation zone however where the rating curve
is well defined by discharge-stage measurements, typically the error coefficient of variation
CHAPTER 4
93
would be small, say 1% to 5% (Kuczera, 1996 and Reis and Stedinger, 2005). As noted by
Kuczera (1999a), there are two cases in which smooth rating curve extension can introduce
systematic error. Firstly an indirect estimate can be made for large floods well beyond the
measured flow; it is this estimate that is then subject to extreme uncertainty. In such cases
estimates that are well below the true discharge can cause significant underestimation in
flood frequency analysis and vice versa. Rating curves are also extended by the slope-
conveyance method, which mainly relies on extrapolation of gauged estimates of the
friction slope so that this slope converges to a constant value. This can cause considerable
systematic error which is difficult to quantify as compared to the log-log extrapolation. As
it is the most commonly employed approach for rating-curve extrapolation, log-log
extrapolation is explored in this study.
In log-log extrapolation, the systematic error can be seen as the likely divergence from the
true rating as the discharge increases. Thus, as the rating curve is extended from the true
rating curve an extension zone is introduced. This extension zone depends on the distance
from the anchor point and not from the origin. In this case the systematic error is
incremental, as it originates from the anchor point. In this study, to implement the concept
of systematic rating curve error, the flow that is closest to RR = 1 is used as the “anchor
point” in the FLIKE rating curve error model (Kuczera 1999b). The assumption is then
made that there is little error (1 to 5%) up to the anchor point (Figure 7). All discharge
estimates with RRs > 1 (this means the true flood discharge exceeds the anchor value) have
systematic error and deviate away from the anchor point. The application of the RR using a
cut-off point value is introduced in this study to remove stations which are likely to be
associated with high rating curve related errors. Further discussion on this is presented in
this chapter, where the impacts on flood quantile estimates of different rating curve errors
and RR values are examined, to demonstrate the importance of accurate flood discharge
estimates.
CHAPTER 4
94
Figure 7 Rating curve extension error
4.4.5 TESTS FOR OUTLIERS
In a set of annual maximum flood series there is a possibility of outliers being present. An
outlier is an observation that deviates significantly from the bulk of the data, which may be
due to errors in data collection or recording, or due to natural causes.
In this study, the Grubbs and Beck (1972) method is adopted in detecting high and low
outliers. This method was recommended in Bulletin 17B by the United States Water
Resources Council after large scale testing of a wide variety of procedures. The method is
based on determining high outlier and low outlier thresholds by applying a one-sided 10%
significance level test that considers the sample size. The test was developed by Grubbs
and Beck (1972) for detecting single outliers from a normal distribution but (when applied
to the logs of a flood data series) has been shown to be also applicable to the log Pearson
type 3 (LP3) distribution. The method is simple to use and has been widely applied in
North America (Ng et al., 2007). Its application to dealing with low outliers is
straightforward. However, it should be noted here that special precaution is needed to treat
any detected high outlier, given that there is a 10% chance of the null hypothesis of no
outliers having been wrongly rejected. If not caused by data error, the 'high outlier' data
point contains very useful information regarding the frequency of large floods.
Anchor Point, RR=1
Maximum measured flow
Actua1 rating curve (reported by gauging
authority)
log discharge
Interpolation zone
True but unknown
rating curve
log stage
Incremental error
Extrapolation zone
CHAPTER 4
95
4.5 RESULTS OF STREAMFLOW DATA PREPARATION PROCESS
The methods described in section 4.4 are applied to gauged flood data to the entire
Australian continent. In this section we present the detailed results for VIC and NSW for
simplicity sake only; further results are summarised and further reading can be found in
Rahman et al. (2009 and 2011a). This section summarises the main findings.
4.5.1 DATA PREPARATION FOR VICTORIA
Based on the selection criteria presented in section 4.3, a total of 415 stations are initially
selected as candidates from VIC each having a minimum of 10 years of streamflow record.
For in-filling the gaps in the annual maximum flood series, Method (a) is preferred over
Method (b) (see section 4.4.1 for a description of these methods). The following points
summarise the results of the in-filling of the annual maximum flood series data: (i) 273 data
points from 187 stations are in-filled by Method (a); (ii) 60 data points from 44 stations are
in-filled by Method (b); (iii) Regression equations used in gap filling have high R2 values
(range 0.82 – 0.99, mean = 0.93 and SD = 0.041); and (d) 10% of stations do not have any
missing records.
After in-filling the gaps, the stations are checked for possible trends. Initially, the Mann-
Kendall test is applied to the annual maximum flood series of the candidate stations. The
results revealed that some 20% of the candidate stations exhibit a decreasing trend, a
somewhat surprising result. However, the record lengths of many of these stations are less
than 20 years, and, moreover, south-east Australia has experienced a severe drought since
the mid 1990’s. To explore this issue further, time series plots and mass curves are
prepared for the stations showing trend to detect visually if significant changes in slope can
be identified. Figure 8 (a) presents the results for Station 230210, which shows a noticeable
decrease in annual maximum flood data from the late 1980’s thus supporting the results
from the Mann-Kendall test. The CUSUM test produced similar results – see Figure 8 (b) -
namely a downward shift in the mean from 1995 onwards.
These results suggest that flood data at many stations are not independently and identically
distributed from year to year. Thus there needs to be caution applied when using short
records in estimating long term flood risks. The fact that data starting in the 1990s
exhibited a significant downward trend for many stations in VIC makes the inclusion of
CHAPTER 4
96
stations with short records in RFFA questionable. Most RFFA methods can compensate for
sampling variability but not for bias introduced by a drought-induced systematic downward
trend in a short record.
To overcome this problem, the introduction of a longer cut-off record length appears to be
appropriate. However, the selection of a cut-off record length involves a trade-off between
spatial coverage and bias. It is judged that a cut-off record length of 25 years is adequate
for the purpose of this study. Although this has removed more than half of the candidate
stations from VIC, the remaining stations would be less affected by bias and thus would
yield more representative RFFA assessments of long-term flood risk. The number of
eligible stations after the introduction of a cut-off length of 25 years, dropped to 144, which
is only 35% of the initially selected 415 stations. This shows that the useful data set for
RFFA in a given region is likely to be substantially smaller than the primary data set.
Figure 8 (a) Time series plot showing significant trends after 1995 and (b) CUSUM test plot showing
significant trends after 1995. Here Vk is CUSUM test statistic defined in McGilchrist and Wodyer
(1975)
20052000199519901985198019751970
10000
7500
5000
2500
0
Year
Ann
ual M
axim
um F
low
(M
L/d)
20052000199519901985198019751970
8
6
4
2
0
Year
Vk
Station 230210
Decrease in flow magnitude
a
Station 230210
b
CHAPTER 4
97
In the remaining data set of 144 stations, many had rating ratios (RR) considerably greater
than 1. From the histogram of RR values shown in Figure 9 it can be seen that 90% of the
RR values for all the recorded annual maxima lie between 1 and 20. A RR value
significantly greater than 1 could magnify the errors in flood frequency quantile estimates
but, on the other hand, rejecting all stations with a RR greater than one would reduce the
number of stations below the minimum required for a meaningful RFFA. Thus, it is
decided that a cut-off RR value of 20 would be reasonable, which has reduced the eligible
number of stations from 144 to 131 for VIC. The impacts of RR values on flood quantile
estimates are presented in section 4.5.3.
Victoria
384
11161
19 18 18
9 10 10
4 5
1
4
2
4
1 1
23
2
1
2
0 0
5
4387
1
10
100
1000
10000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50More
Rating Ratio - RR
Fre
quen
cy
90% of rating ratio ’s lie between 1 & 20
Figure 9 Histogram of rating ratios of annual maximum flood data in Victoria (stations with record
lengths > 25 years)
The results of the outlier detection procedure are summarised here: (a) Some 43% of the
stations are found to have low outliers. The maximum number of low outliers detected in a
data series is 5 and never exceed 19% of the total number of data points in a series. (b)
Most of the detected low outliers occurr for stations which are located in low rainfall areas,
especially in the western part of VIC. (c) 31% of low outliers occurred in the years 1982
and 1967. Severe drought occurred during these years with the maximum annual flows in
many rivers being baseflow rather than a flood. Similar results were reported by Rahman
CHAPTER 4
98
(1997). (d) 55% of the stations do not show any outliers. Even the values in the drought
years of 1967 and 1982 are not low enough to be treated as low outliers. The locations of
most of these stations are in the south-eastern part of Victoria. (e) Only 1 station shows a
high outlier. The detected low outliers are treated as censored flows in flood frequency
analysis using FLIKE (that is, the information that there is no flood in that year is taken
into account).
The final VIC database contains 131 stations whose record lengths range from 25 to 52
years (mean and median: 32 years and standard deviation: 5 years). Some 87% of the
stations have record lengths in the range 25-35 years, 8% of the stations in the range 35-45
years and 5% of the stations in the range 50-55 years. The catchment areas range from 3 to
997 km2 (mean: 321 km2 and median: 289 km2). Some 15 catchments (11%) are in the
range of 3 to 50 km2, 11 catchments (8%) are in the range of 51 to 100 km2, 78 catchments
(60%) are in the range of 101 to 500 km2; and 27 catchments (21%) are in the range of 501
to 997 km2. The histogram of streamflow record lengths of the 131 stations is shown in
Figure 10. The distribution of catchment areas is shown in Figure 11. The geographical
distribution of these stations is shown in Figure 16, which shows that there is no station in
north-western VIC that has passed the selection criteria. This region indeed is characterised
by very low runoff and ephemeral streams.
Figure 10 Distributions of streamflow record lengths of the selected 131 stations from Victoria
23
78
20
3 52
0
10
20
30
40
50
60
70
80
90
25 - 29 30 - 34 35 - 39 40 - 44 45 - 50 51 - 55
Record Length (years)
Fre
quen
cy
CHAPTER 4
99
6
20
24
18
23
13
6
10
45
2
0
5
10
15
20
25
30
0 - 25 26 -100
101 -200
201 -300
301 -400
401 -500
501 -600
601 -700
701 -800
801 -900
901 -1000
Catchment Area (km 2)
Fre
quen
cy
Figure 11 Distributions of catchment areas of the 131 catchments from Victoria
4.5.2 DATA PREPARATION FOR NSW AND ACT
Initially, a total of 635 stations are selected from NSW and the Australian Capital Territory
(ACT). After in-filling the gaps and using the selection criteria discussed in section 4.3,
only 294 stations are retained with a minimum of 10 years of annual maximum flood data.
The Mann-Kendall test, time series plot inspection and CUSUM test resulted in some 11%
of the stations (31 stations) being identified as having a decreasing trend, generally after
1990. A cut-off record length of 25 years is adopted similar to Victoria, which has reduced
the number of eligible stations to 106, which is only 17% of the initially selected 635
stations.
In the remaining data set of 106 stations from NSW, many had RR values considerably
greater than 1 – see Figure 12. As for the VIC data, a cut-off RR value of 20 is adopted,
which has reduced the eligible number of stations from 106 to 96.
CHAPTER 4
100
Figure 12 Histogram of rating ratios for 106 stations from NSW
Some 40% of the stations from NSW and ACT are found to have low outliers. The
maximum number of low outliers detected in a data series is 9 and has never exceeded 21%
of the total number of data points in a series. Most of these detected low outliers occur for
stations located in low rainfall areas, especially in the western parts of NSW. Some 31% of
low outliers occur in the years 1967, 1982 and 1994. About 47% of the stations do not
show any outliers. Only 5 stations have shown a high outlier. The record lengths of the 96
stations range from 25 to 74 years (mean: 34 years, median: 31 years and standard
deviation: 10 years). Some 77% of the stations have record lengths in the range 25-35
years, and 18% of the stations in the range 40-55 years; and 5% in the range 60-75 years.
The catchment areas range from 8 to 1010 km2, with an average value of 353 km2, median
of 267 km2 and a standard deviation of 276 km2. Some 9 catchments (9%) are in the range
of 8 to 50 km2, 9 catchments (9%) are in the range of 51 to 100 km2, 52 catchments (54%)
are in the range of 101 to 500 km2 and 27 catchments (28%) are in the range of 501 to 1010
km2. The histogram of streamflow record lengths of the 96 stations is shown in Figure 13.
The distribution of catchment areas is shown in Figure 14. The geographical distribution of
the 96 stations is shown in Figure 16. There is no station in far western NSW that has
passed the selection criteria.
NSW
774
222
9967 61
2113
9 85 5
2
4
0
5
0
2
2162
1
10
100
1000
10000
1 3 5 7 9 12 14 16 18 20 22 24 26 28 30 35 40 45
Rating Ratio - RR
Fre
quen
cy
Over 95% of rating ratiosbetween 1 and 20
CHAPTER 4
101
7
41
26
5 5 5
2 2 20
1
0
5
10
15
20
25
30
35
40
45
25 - 29 30 - 34 35 - 39 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 65 - 69 70 - 74 >75
Record Length (years)
Fre
quen
cy
Figure 13 Distributions of streamflow record lengths of the selected 96 stations from NSW
89
20
1312
78
45
6
3
1
0
5
10
15
20
25
0 - 25 26 - 100 101 -200
201 -300
301 -400
401 -500
501 -600
601 -700
701 -800
801 -900
901 -1000
>1000
Catchment Area (km 2)
Fre
quen
cy
Figure 14 Distributions of catchment areas of the 96 catchments from NSW
CHAPTER 4
102
4.5.3 SENSITIVITY ANALYSIS - IMPACT OF RATING CURVE ERROR ON
FLOOD QUANTILE ESTIMATES
To assess the impact of rating curve error (expressed in terms of RR) on flood quantile
estimates, the FLIKE software, which implements the principles outlined in Kuczera
(1999a, b), is employed to fit the LP3 distribution using the Bayesian parameter fitting
procedure. In this application of FLIKE, no prior information is used with both the ‘no
rating curve error’ and the ‘rating curve error’ cases. The flow closest to RR = 1 is used as
the “anchor point” in the rating curve error model inbuilt in FLIKE. The flows greater than
RR = 1 are expected to be associated with measurement errors i.e. the higher the RR value
for a data point the greater the degree of rating curve extrapolation error associated with it
(see Figure 7). In the flood frequency analysis using FLIKE for the ‘rating error’ case, less
weight is assigned to the flow data points beyond the anchor point (which represents higher
flows).
Three cases are considered here for illustration purposes where flows in excess of the
anchor point are corrupted by a multiplicative error assumed to be log-normally distributed
with mean one and coefficient of variation (CV) equal to 10%, 20% and 30%. Also, four
different values of maximum RR are considered (5, 10, 20 and 40). Four stations from the
database for VIC and NSW are selected with maximum RR values in the range of 5-40:
Station 210040 (RR = 5), Station 222213 (RR = 10), Station 234209 (RR = 20) and Station
221201 (RR = 40).
Table 1 presents the flood quantile estimates using FLIKE for four scenarios where the
coefficient of variation of multiplicative errors CV equal to 0%, 10%, 20% and 30%. For
each of these four scenarios, stations with maximum RR values of 5, 10, 20 and 40 are
analysed. Table 1 presents the expected quantile and the lower and 95% confidence limits
for the 50- and 100-year flood. To assist interpretation, the results for the cases where
rating curve error is assumed present (i.e., CV > 0) are expressed as ratios for the case CV
> 0 to the case CV = 0.
CHAPTER 4
103
Table 1 Flood quantile estimates and associated errors using ARR FLIKE with and without consideration of rating curve error
(MMF = maximum measured flow)
Rating error CV = 0% Ratio for CV = 10%
and CV=0%
Ratio for CV = 20% and CV=0%
Ratio for CV = 30% and CV=0%
LL 95%
Expected UL 95%
Expected, %
CL width,
%
Expected, %
CL width, %
Expected, %
CL width,
%
Station Maximum
RR
ARI of MMF (yr)
50-year flood quantile
210040 5 2.77 778 1567 4753 102.5 105.2 111.4 123.5 121.4 147.3
222213 10 1.80 101 175 416 101.6 109.0 103.4 118.3 104.2 133.1
234209 20 1.03 22 28 46 134.4 189.0 133.7 193.3 149.0 244.2
221201 40 3.77 281 397 693 108.7 126.8 120.0 167.0 138.1 224.7
100-year flood quantile
210040 5 2.77 1018 2270 8854 103.5 105.2 114.5 126.6 127.0 146.9
222213 10 1.80 123 235 682 102.2 107.8 104.1 116.7 105.2 132.1
234209 20 1.03 23 30 56 137.9 185.4 142.5 196.7 161.3 250.4
221201 40 3.77 321 465 912 111.4 129.9 126.1 176.4 149.3 249.0
CHAPTER 4
104
The results show that the width of the 95% quantile confidence limits increases with
increasing rating curve error CV reflecting the fact that errors in estimating the bigger flood
flows reduce the information content of the higher flows. Indeed, in the worst case, the
confidence limit width increases by 250%. Moreover, the bias in quantile estimates
increases with increasing CV, in some cases reaching 50% to 60%. This confirms the
soundness of the eliminating stations judged to have poor quality ratings.
Of interest is the relationship of quantile bias and accuracy with maximum RR. It appears
that as the maximum RR increases, the bias and uncertainty in the quantiles tends to grow
for a given rating curve error CV. The trend is somewhat obscured by the fact that the ARI
of the maximum measured flow (i.e. the anchor point) varies. As the ARI of the anchor
point grows fewer flows are affected by rating curve errors; for example, if the ARI of the
anchor point is 2 years, then half of the data will lie below the anchor point, largely
unaffected by rating curve error. Thus one can see that station 221201 which has maximum
RR of 40 but an anchor point ARI of 3.77 years has similar bias and accuracy to Station
234209 which has a lower maximum RR of 20 but an anchor point ARI of 1.03 years.
Although this analysis is not conclusive, it does suggest that stations with high maximum
RR values are likely to be problematic unless some form of compensation for rating curve
error is made.
4.6 SUMMARY RESULTS OF STREAMFLOW DATA PREPARATION FOR
THE OTHER STATES
The methods applied in section 4.5 are applied to gauged flood data in the entire Australian
continent. In this section we present the summary results for QLD, TAS, Northern
Territory (NT), Western Australia (WA) and South Australia (SA). Further results can be
found in Rahman et al. (2009 and 2011b). This section also presents a summary of the final
catchments adopted for this study.
4.6.1 TASMANIA
A total of 53 catchments have been selected from TAS. The record lengths of annual
maximum flood series of these 53 stations range from 19 to 74 years (mean: 30 years,
median: 28 years and standard deviation: 10.43 years). The catchment areas of the selected
53 catchments range from 1.3 km2 to 1900 km2 (mean: 323 km2 and median: 158 km2). The
geographical distribution of the selected 53 catchments is shown in Figure 16.
CHAPTER 4
105
4.6.2 QUEENSLAND
A total of 172 catchments have been selected from QLD. The record lengths of annual
maximum flood series of these 172 stations range from 25 to 97 years (mean: 41 years,
median: 36 years and standard deviation: 15.2 years). The catchment areas of the selected
172 catchments range from 7 km2 to 963 km2 (mean: 325 km2, median: 254 km2). The
geographical distribution of the selected 172 catchments is shown in Figure 16.
4.6.3 SOUTH AUSTRALIA
A total of 29 catchments have been selected from SA. The record lengths of annual
maximum flood series of these 29 stations range from 18 to 67 years (mean: 36 years,
median: 34 years and standard deviation: 11.2 years). The catchment areas of the selected
29 catchments range from 0.6 km2 to 708 km2 (mean: 170 km2 and median: 76.5 km2). The
geographical distribution of the selected 29 catchments is shown in Figure 16.
4.6.4 NORTHERN TERRITORY
A total of 55 catchments have been selected from NT. The record lengths of annual
maximum flood series of these 55 stations range from 19 to 54 years (mean: 35 years,
median: 33 years and standard deviation: 11.30 years). The catchment areas of the selected
55 catchments range from 1.4 km2 to 4,325 km2 (mean: 682 km2 and median: 360 km2).
The geographical distribution of the selected 55 catchments is shown in Figure 16.
4.6.5 WESTERN AUSTRALIA
A total of 146 catchments have been selected from WA. The record lengths of annual
maximum flood series of these 146 stations range from 20 to 57 years (mean: 31 years,
median: 30 years and standard deviation: 8.02 years). The catchment areas of the selected
146 catchments range from 0.1 km2 to 7,405.7 km2 (mean: 323 km2 and median: 60 km2).
The geographical distribution of the selected 146 catchments is shown in Figure 16.
4.6.6 SUMMARY OF STREAMFLOW DATA AUSTRALIA WIDE
A total of 682 catchments have been selected from all over Australia. The record lengths of
the annual maximum flood series of these 682 stations range from 18 to 97 years (mean: 35
years, median: 33 years and standard deviation: 11.5 years). The distribution of record
lengths is shown in Figure 15 (a).
CHAPTER 4
106
The catchment areas of the selected 682 catchments range from 0.1 km2 to 7,405.7 km2
(mean: 350 km2, median: 214 km2). The geographical distribution of the selected 682
catchments is shown in Figure 16. The distribution of catchment areas of these stations is
shown in Figure 15 (b).
71-9761-7051-6041-5031-4026-3018-25
300
200
100
0
Record Length (years)
Fre
qu
ency
2001-74061001-2000501-1000101-50051-10021-500.1-20
300
200
100
0
Catchment Area (km-sq)
Fre
qu
ency
171129
84
307
149
85
915
135
301
766780
a
b
Figure 15 (a) Distribution of annual maximum flood record lengths of 682 stations from all over
Australia (b) Distribution of catchment areas of 682 stations from all over Australia
CHAPTER 4
107
Figure 16 Geographical distributions of the selected 682 stations from all over Australia
The summary of all the Australian data prepared as a part of this study is provided in Table
2.
Table 2 Summary of selected stations Australia wide
State No. of
stations Median streamflow record length
(years) Median catchment size
(km2)
NSW and ACT 96 34 267
VIC 131 33 289
SA 29 34 76.5
TAS 53 28 158
QLD 172 36 254
WA 146 30 60
NT 55 33 360
Total 682 - -
CHAPTER 4
108
4.7 SELECTION AND ABSTRACTION OF CATCHMENT
CHARACTERISITCS
Catchment characteristics used in many previous RFFA studies were summarised by
Rahman (1997). He grouped the catchment characteristics under the headings of climatic
characteristics, morphometric characteristics, catchment cover and land use characteristics,
geological and soil characteristics, catchment storage characteristics, and location
characteristics. Many catchment characteristics are highly correlated, and the inclusion of
strongly correlated variables in prediction equations does not add any new information; it
also causes problems in statistical analysis (e.g. multicollinearity). The following
guidelines can be useful in making a reasonable selection:
The characteristics should have a plausible role in flood generation.
They should be unambiguously defined.
Characteristics should be easily obtainable. When a simpler characteristic and a
complex one are correlated and have similar effects then the simpler characteristic
should be chosen.
If a derived/combined characteristic is used, it should have a simple physical
interpretation.
The characteristics in the selected set should not be highly correlated, because this
results in unstable parameters in hydrologic regression analysis.
The prediction performance of a characteristic in other regionalisation studies should be
taken into account, as this can give some general idea regarding the importance of the
characteristic.
Based on the hydrological significance, correlations and ease of the data abstraction, eight
catchment characteristics are included in this study as listed in Table 3, and described
below.
Catchment area: Catchment area is the main scaling factor in the flood process and
directly affects the potential flood magnitude from a given storm event. The total volume of
runoff (Q) is proportional to the area of the catchment area (A), and of the general form:
Q = cAm (4.2)
CHAPTER 4
109
where the exponent m varies from 0.5 to 1.00.
Table 3 Catchment characteristics variables used in the study
Catchment characteristics
1. area: Catchment area (km2)
2. I: Design rainfall intensity (mm/h)
3. rain: Mean annual rainfall (mm)
4. evap: Mean annual areal potential evapotranspiration (mm)
5. S1085: Slope of the central 75% of mainstream (m/km)
6. sden: Stream density (km/km2)
7. forest: Fraction of catchment area under forest.
8. qsa: Fraction quaternary sediment area (VIC only).
Almost all of the reported RFFA studies have found catchment area to be very significant.
One of the reasons why the area variable has been so useful in statistical hydrology is its
association with other significant morphometric characteristics like slope, stream length
and stream order. Area was characterised by Anderson (1957) as the ‘devil’s own variable’,
because almost every watershed characteristic is correlated with it. As in the case of area,
the mean annual flood is directly proportional to other morphometric characteristics, which
are again directly proportional to area.
In this study, catchment area is obtained from 1:100,000 topographic maps which are
readily available for large parts of Australia.
Rainfall intensity: Storm rainfall intensity (IARI,d), for an appropriate burst duration (d) and
average recurrence interval (ARI), has been found to be the most significant predictor
climatic characteristic in previous RFFA studies. This is to be expected given the strong
causal link between intensity and peak flow. Importantly, this data is simple to obtain from
the published data (e.g. ARR1987 Volume 2).
The use of rainfall intensity requires the selection of an appropriate storm burst duration
and ARI. It seems to be logical to use a design rainfall intensity with a duration equal to the
time of concentration (tc), as suggested in the probabilistic rational method (I.E. Aust.,
1987, 2001). This is because as catchment area gets bigger, tc gets longer, which results in
smaller average design rainfall intensity. However, there are different methods to estimate
CHAPTER 4
110
tc e.g. Bransby Williams formula and Friend formula (I.E. Aust., 2001). For consistency,
and ease of application, the formula recommended in ARR 1987 for VIC and eastern NSW,
given by Equation 4.3, is adopted in this study.
38.076.0 Atc (4.3)
where tc is time of concentration in hours and A is catchment area in km2.
In addition to the design rainfall intensity for a given ARI and tc (IARI,tc), rainfall intensities
with fixed durations and ARIs are also trialled e.g. rainfall intensities with 2 and 50 years
ARIs and 1 and 12 hours durations.
The various design rainfall intensities data for the selected study catchments are obtained
using the intensity frequency duration (IFD) Calculator on the BOM website or the design
data in ARR Volume 2.
Mean annual rainfall: Mean annual rainfall has been used frequently in many previous
RFFA studies. It may not have a direct link with flood peak, but it acts as a surrogate for
some other characteristics (e.g. vegetation and wetness index) and is readily available.
Thus, mean annual rainfall is included as a predictor variable in this study. The data for the
mean annual rainfall for each catchment is extracted from the BOM Data CD of Annual
Rainfall.
Mean annual evaporation: This relates to the main loss component in the rainfall-runoff
process. It is readily available and thus is included in this study. The mean annual areal
potential evapotranspiration data for each catchment is extracted from the BOM Data CD
of Evaporation.
Slope: Slope is significant for any gravitational flow. With other catchment characteristics
held constant the steeper the slope the greater the velocity of flow. Both overland and
channel slope are important. Overland slope influences the velocity of shallow surface
flow; hence, it can be expected to be of more importance for smaller catchments where the
time spent in overland flow is a significant percentage of the total time needed for water to
CHAPTER 4
111
reach the catchment outlet. For larger catchments, channel slope is relatively more
important than overland slope.
There are several measures of slope; the most common of these are:
Equal area slope: This is the slope of a straight line drawn on a profile of a stream such
that the line passes through the outlet and has the same area under and above the stream
profile.
Average slope: This is equal to the total relief of the main stream divided by its length.
S1085: This excludes the extremes of slope that can be found at either end of the
mainstream. It is the ratio of the difference in elevation of the stream bed at 85% and 10%
of its length from the catchment outlet, and 75% of the main stream length.
Areal slope: This involves measuring the slope at a large number of points within a
catchment and then determining an average areal slope.
Taylor and Schwarz (1952) slope: This assumes that velocity in each reach of a subdivided
mainstream is related via the Manning’s equation to the square root of slope. This index is
equivalent to the slope of a uniform channel having the same length as the longest water
course and an equal time of travel.
In previous studies Strahler (1950) has shown that the overland slope and channel slope are
strongly correlated. Benson (1959) found that S1085 gave the best prediction of the mean
annual flood. The S1085 is closely correlated with the Taylor and Schwarz slope (NERC,
1975).
From the different measures of slope, S1085 is deemed adequate and the simplest to
estimate from 1:100,000 topographic maps and thus has been adopted in this study.
Stream density: This is directly related to drainage efficiency of a catchment, and has been
included in this study where possible. The definition of stream density is total stream
length, which is taken as the sum of the length of all the blue lines in catchment as shown
CHAPTER 4
112
on 1:100,000 topographic maps, divided by catchment area. The length of the blue lines
can be measured by opisometer/electronic distance meter or can be obtained using GIS.
Stream density is not easy to measure and also the measured value depends on the map
scale used. It should be retained in the final prediction equation only if it delivers
significantly improved design flood estimates. Also, if it is used in final flood prediction
equations, the procedure should stress the map scale to be used in its measurement.
Forest area: The effect of vegetation on catchment response has been studied by many
researchers (e.g. Flavell and Belstead, 1986; Williamson and Vand Der Wel, 1991; Flavell,
1982). Forest reduces runoff by precipitation interception and transpiration. For a surface
without a canopy or leaf litter layer, the interception loss is lower and overland flow travels
more rapidly with less opportunity time for infiltration. Hence, Flavell (1982) found that
losses from rainfall decrease with increased clearing and that the runoff coefficient of the
rational method increases with increased clearing. Fraction forest cover has been included
in this study. The fraction of catchment covered by forest is estimated on 1:100,000
topographic maps by using a planimeter to measure the areas designated as dense and
medium forest, and dense and medium scrub.
Quaternary sediment area (VIC only): Storage directly affects the shape of the flood
hydrograph, however defining storage as a single parameter is difficult. Quaternary
sediment area appears to be an influential surrogate for storage, because it s a good
indicator of floodplain extent variability in a catchment. Values for quaternary sediment
area are determined from 1:250,000 geological maps.
4.8 SUMMARY
The first part of this chapter has examined various aspects of the streamflow data collation
adopted for this thesis. A total of 682 catchments have been selected from the continent of
Australia (excluding the arid region see Figures 5 and 16). The annual instantaneous
maximum flood series of the stations have been collected, gaps filled, rating curve
extrapolation errors identified, trends and shifts in data analysis identified and outlier points
censored. A sensitivity analysis has also been undertaken to understand the impacts of
rating curve error on flood quantile estimation. The second part of this chapter has
examined the candidate catchment characteristics for this study, a brief explanation has
CHAPTER 4
113
been given about each variable and how these data have been obtained. All the variables
listed in Table 3 are used in the analyses presented in the subsequent chapters of this thesis.
CHAPTER 5
114
CHAPTER 5: RESULTS – RFFA BASED ON FIXED REGIONS AND
REGION OF INFLUENCE APPROACHES UNDER THE QUANTILE
AND PARAMETER REGRESSION FRAMEWORKS
5.1 GENERAL
This chapter develops flood prediction equations (for 6 average recurrence intervals
(ARIs), which are 2, 5, 10, 20, 50 and 100 years) using both a fixed region and region of
influence (ROI) approach in a quantile regression technique (QRT) and parameter
regression technique (PRT) framework. The ROI approach is adopted to reduce the degree
of heterogeneity present in Australian annual maximum flood regions to enhance the
accuracy in design flood estimates. The Bayesian generalised least squares regression
(BGLSR) technique is adopted for the parameter estimation which explicitly accounts for
the inter-station correlation present in the annual maximum flood series (AMFS) data and it
distinguishes between the sampling and model errors in regression analysis. The developed
prediction equations allow for design flood or flood statistic estimates to be made at an
ungauged catchment given the relevant catchment characteristics data. To assess the
performances of the developed prediction equations, a Leave-one-out (LOO) validation
procedure is adopted. The basic theory and assumptions associated with the QRT and PRT
in a ROI BGLSR framework have been discussed in Chapter 3.
5.1.1 PUBLICATIONS
Four journal papers (ERA, ranks A*, A, B and B) have been published based on the results
presented in this chapter. These journal papers are given in Appendix A and noted below:
Haddad, K. and Rahman, A. (2012). Regional flood frequency analysis in eastern
Australia: Bayesian GLS regression-based methods within fixed region and ROI
framework: Quantile Regression vs. Parameter Regression Technique. Journal of
Hydrology, 430-431, 142-161.
Haddad, K., Rahman, A. and Stedinger, J. R. (2012). Regional Flood Frequency Analysis
using Bayesian Generalized Least Squares: A Comparison between Quantile and Parameter
Regression Techniques. Hydrological Processes, 25, 1-14.
CHAPTER 5
115
Haddad, K., Rahman, A. and Kuczera, G. (2011). Comparison of Ordinary and
Generalised Least Squares Regression Models in Regional Flood Frequency Analysis: A
Case Study for New South Wales. Australian Journal of Water Resources, 15(2), 1-12.
Haddad, K., Zaman, M. and Rahman, A. (2010b). Regionalisation of skew for flood
frequency analysis: a case study for eastern NSW. Australian Journal of Water Resources,
14(1), 33-41.
5.2 RESULTS FOR TASMANIA
5.2.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT
A total of 53 catchments were used from Tasmania for the analyses presented here. The
locations of these catchments are shown in Figure 16. The AMFS record lengths of these
53 stations range from 19 to 74 years (mean 30 years, median 28 years and standard
deviation 10 years). The catchment areas of these 53 stations range from 1.3 to 1,900 km2
(mean 323 km2, median 158 km2 and standard deviation 417 km2).
In the fixed region approach, all the 53 catchments were considered to have formed one
region, however, one catchment was left out for cross-validation and the procedure was
repeated 53 times to implement the LOO validation. Hence, the model data set contained
52 catchments in each iteration step. In the ROI approach, an optimum region was formed
for each of the 53 catchments by starting with 15 stations in the first proposed region and
then consecutively adding 1 station at each iteration step.
Table 4 shows the different combinations of predictor variables for the Q10 QRT model and
the models for the first three parameters of the log Pearson Type 3 (LP3) distribution.
Figure 17 and 18 show example plots of the statistics used in selecting the best set of
predictor variables for the Q10 and skew models. According to the model error variance
(MEV), combinations 6, 16, 18, 20, 17, 19 and 4 were potential sets of predictor variables
for the Q10 model. Combinations 16, 18, 20, 17, 19 and 4 contained 3 to 4 predictor
variables, while combinations 6 and 4 contained 2 predictor variables. Indeed, combination
6 with the 2 predictor variables (area and design rainfall intensity 50I12) showed the lowest
MEV and the highest pseudo coefficient of determination ( 2
GLSR ). The average variance of
CHAPTER 5
116
prediction old (AVPO), average variance of prediction new (AVPN), Akaike information
criteria (AIC) and Bayesian information criteria (BIC) values favour combination 6 as well.
Combination 6 was compared to combination 10 (the latter also contains 2 predictor
variables, area and design rainfall intensity Itc,10). Combination 6 had a smaller MEV while
also showing the regression coefficient for variable 50I12 to be 5.5 times the posterior
standard deviation away from zero, as compared to 4 times for Itc,10. Hence, combination 6
was finally selected as the best set of predictor variables for the Q10 model.
For the skew model, combination 4 showed the lowest MEV (0.034) and the highest R2GLS
(52%) (Figure 18), as well as the lowest AIC and BIC. Combination 1 without any
explanatory variables ranked 13 out of the 16 possible combinations (MEV of 0.045); it
also showed higher AVPO and AVPN as compared to combination 4, hence combination 4
was finally selected.
A similar procedure was adopted in selecting the best set of predictor values for other
models with the QRT and PRT. The sets of predictor variables selected as above were used
in the LOO validation with fixed regions and ROI approaches.
The Bayesian plausibility values (BPV) for the regression coefficients associated with the
QRT over all the ARIs were between 2% and 8% for the variable area and 0.000% for
design rainfall intensity 50
I12. This justifies the inclusion of predictor variables area and
50I12 in the prediction equations for QRT. The BPVs for the skew model were 23% and
11% for area and 50
I1, respectively indicating these variables are not very good predictors
for skew. The BPVs for the mean model were close to 1% for both the predictor variables.
For the standard deviation model, the BPV for the predictor variable rain was 1%.
Regression equations developed for the QRT and PRT for the fixed region are given by
Equations 5.1 to 5.9:
ln(Q2) = 4.18 + 0.91(area) + 3,35(50I12) (5.1)
ln(Q5) = 4.59+ 0.89(area) + 2.80(50I12) (5.2)
ln(Q10) = 4.87 + 0.85(area) + 2.57(50I12) (5.3)
ln(Q20) = 5.09 + 0.84(area) + 2.39(50I12) (5.4)
ln(Q50) = 5.45 + 0.84(area) + 2.23(50I12) (5.5)
ln(Q100) = 5.48 + 0.82(area) + 2.02(50I12) (5.6)
CHAPTER 5
117
ln(Q ) = 4.00 + 0.90(area) + 3.85(2I12) (5.7)
stdev = 0.64 + 0.55(rain) (5.8)
skew = – 0.05 + 0.07(area) + 1.20(50I1) (5.9)
It is reassuring to observe that the regression coefficients in the QRT set of equations vary
in a regular fashion with increasing ARI.
Table 4 Different combinations of predictor variables considered for the QRT models and
the parameters of the LP3 distribution (QRT and PRT fixed region Tasmania)
Combination Combinations for mean,
standard deviation & skew
models
Combinations for flood quantile
models
1 Const Const
2 Const, area Const, area
3 Const, area, (2I1) Const, area, 2I1
4 Const, area, (50I1) Const, area, 2I12
5 Const, area, (2I12) Const, area, 50I1
6 Const, area, (50I12) Const, area, 50I12
7 Const, area, rain Const, area, rain
8 Const, area, forest Const, area, forest
9 Const, area, evap Const, area, forest, evap
10 Const, area, S1085 Const, area, Itc,ARI
11 Const, area, sden Const, area, evap
12 Const, sden, rain Const, area, S1085
13 Const, forest, rain Const, area, sden
14 Const, S1085, forest Const, sden, rain
15 Const, evap Const, forest, rain
16 Const, rain, evap Const, area, 50I12, rain
17 Const, rain Const, area, 50I12, sden
18 - Const, area, 50I12, rain, evap
19 - Const, area, 50I12, Itc,ARI, evap
20 - Const, area, 50I12, Itc,ARI, rain, evap
21 - Const, area, 50I12, Itc,ARI, sden
22 - Const, area, 50I12, Itc,ARI, S1085
23 - Const, area, Itc,ARI, evap
24 - Const, area, Itc,ARI, rain
CHAPTER 5
118
25 - Const, area, 2I1, Itc,ARI
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6 16 18 20 17 19 4 22 24 21 10 23 3 7 12 5 11 9 2 8 13 15 14 1 25
Combination of Catchment Characteristics
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%MEV Standard Error of MEV R-sqd GLS
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 17 Selection of predictor variables for the BGLSR model for Q10 (QRT, fixed region Tasmania),
MEV = model error variance, AVPO = average variance of prediction (old), AVPN = average variance
of prediction (new), AIC = Akaike information criterion, BIC = Bayesian information criterion, note 2
GLSR uses right hand axis
CHAPTER 5
119
Figure 18 Selection of predictor variables for the BGLSR model for skew
5.2.2 PSUEDO ANOVA WITH QRT AND PRT MODELS FOR THE FIXED AND
ROI REGIONS
The pseudo analysis of variance (ANOVA) tables for the Q20 and Q100 models and the
parameters of the LP3 distribution are presented in Tables 5 – 9 for the fixed regions and
ROI. This is an extension of the ANOVA in ordinary least squares regression (OLSR)
which does not recognise and correct for the expected sampling variance (Reis et al., 2005).
For the LP3 parameters, the sampling error increases as the order of moment increases i.e.
the error variance ratio (EVR) increases with the order of the moments. An EVR of greater
than 0.20 may indicate that the sampling variance is not negligible when compared to the
model error variance, which suggests the need for a GLSR analysis (Gruber et al., 2007).
The ROI shows a reduced MEV (i.e. a reduced heterogeneity) as compared to the fixed
regions, as fewer sites have been used. The model error dominates the regional analysis for
the mean flood and the standard deviation models for both the fixed regions and the ROI.
However, the ROI shows a higher EVR than the fixed region case, e.g. for the mean flood
model the EVR is 0.20 for the ROI and 0.06 for the fixed region (Table 7). For the standard
deviation model the EVR is 0.66 for the ROI and 0.54 for the fixed region, which is a 12%
0.00
0.05
0.10
0.15
0.20
0.25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Combination of Catchment Characteristics
AVPO AVPN AIC BIC
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
4 3 6 15 9 5 16 10 11 2 7 12 1 13 8 14
Combination of Catchment Characteristics
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%MEV Standard Error of MEV R-sqd GLS
CHAPTER 5
120
increase in EVR (Table 8). This shows that the ROI indeed deals better with heterogeneity,
even if only slightly.
The EVR values for the skew model are 9 and 9.3 for the fixed regions and ROI
respectively (Tables 9), which are much higher than the recommended limit of 0.20. Again
the GLSR should be the preferred modeling choice over the OLSR. Given that the skew
model has a high sampling error component, an OLSR model would give misleading
results. The advantage of GLSR is that it can distinguish between the variance due to
model error and sampling error as explained in Chapter 2. Importantly, the Bayesian
procedure adds another dimension to the analysis, by computing expectations over the
entire posterior distribution. It has provided a more reasonable estimate of the MEV where
the method of moment’s estimator would have been grossly underestimated the model error
variance, as the sampling error has overwhelmed the analysis. As far as the ROI is
concerned, there is little change in the EVR as compared to the fixed region, as the skew
model tends to include more stations in the regional analysis.
Pseudo ANOVA tables were also prepared for the flood quantile models. For example,
Tables 5 and 6 show the results for the Q20 and Q100 models, respectively. Here the ROI
shows a higher EVR than the fixed region. This suggests that the BGLSR should be used
with ROI in developing the flood quantile models, especially as the ARI increases.
Table 5 Pseudo ANOVA table for Q20 model for Tasmania (QRT, fixed region and ROI)
Source Degrees of freedom Sum of squares
Fixed region ROI Equations Fixed region
ROI
Model k=3 k=3 n )( 22
0 = 34.3 37.5
Model error n-k-1=48 n-k-1=30
n )( 2 = 15.5 12.2
Sampling error N = 52 N = 34 )]ˆ([ ytr = 2.08 1.99
Total 2n-1 = 103 2n-1 = 67
Sum of the above
= 51.9 51.7
EVR 0.13 0.16
CHAPTER 5
121
Table 6 Pseudo ANOVA table for Q100 model for Tasmania (QRT, fixed region and ROI)
Table 7 Pseudo ANOVA table for the mean flood model for Tasmania (PRT, fixed region
and ROI)
Table 8 Pseudo ANOVA table for the standard deviation model for Tasmania (PRT, fixed
region and ROI)
Source Degrees of freedom Sum of squares
Fixed region ROI Fixed region
ROI
Model k=3 k=3 30.7 34.1
Model error n-k-1=48 n-k-1=20 19.0 15.7
Sampling error N = 52 N = 52 3.3 3.13
Total 2n-1 = 103 2n-1 = 103
Sum of the above
= 53.0 52.9
EVR 0.17 0.2
Source Degrees of freedom Sum of squares
Fixed region ROI Fixed region
ROI
Model k=3 k=3 n )( 22
0 = 30.5 54.6
Model error n-k-1=48 n-k-1=24 n )( 2 = 17.8 7.1
Sampling error N = 52 N = 28 )]ˆ([ ytr = 1.13 1.02
Total 2n-1 = 103 2n-1 = 55
Sum of the above
= 49.4 63
EVR 0.06 0.2
Source Degrees of freedom Sum of squares
Fixed region ROI
Fixed region ROI
Model k=2 k=2 3.6 3.5
Model error n-k-1=49 n-k-1=33 3.6 3.3
Sampling error N = 52 N = 52 1.9 2.2
Total 2n-1 = 103 2n-1 = 103
Sum of the above
= 9.1 9.0
EVR 0.54 0.66
CHAPTER 5
122
Table 9 Pseudo ANOVA table for the skew model for Tasmania (PRT, fixed region and
ROI)
5.2.3 ASSESMENT OF MODEL ASSUMPTIONS AND REGRESSION
DIAGNOSTICS
To assess the underlying model assumptions (i.e. the normality of residuals), the plots of
the standardised residuals vs. predicted values were examined. The predicted values were
obtained from LOO validation. Figures 19 to 20 show the plots for the flood quantile Q20
for the fixed region and ROI using the QRT and PRT framework. The underlying model
assumptions are satisfied to a large extent, as 95% of the standardised residuals values fall
between the limits of ± 2. The ROI shows standardised residuals closer to the ± 2 limits.
The results in Figures 19 to 20 reveal that the developed equations satisfy the normality of
residual assumption quite satisfactorily. Also no specific pattern (heteroscedasicity) can be
identified, with the standardised values being almost equally distributed below and above
zero. Similar results were obtained for the skew, standard deviation and other flood
quantile models, which are not shown in this thesis due to space constraints.
Source Degrees of freedom Sum of squares
Fixed region ROI Fixed region
ROI
Model k=3 k=3 0.62 1.80
Model error n-k-1=48 n-k-1=46 1.74 1.54
Sampling error N = 52 N = 50 15.5 14.4
Total 2n-1 = 103 2n-1 = 99
Sum of the above
= 17.8 17.7
EVR 9.0 9.3
CHAPTER 5
123
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
1 2 3 4 5 6 7 8
Fitted LN(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (FIXED REGION) BGLSR-PRT (FIXED REGION)
Figure 19 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, fixed
region, Tasmania)
Figure 20 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, ROI,
Tasmania)
The QQ-plots of the standardised residuals (Equation 3.42) vs. normal score (Equation
3.43) for the fixed region (based on LOO validation) and ROI were also examined. Figures
21 and 22 present results for the Q20 flood quantile model, which shows that all the points
closely follow a straight line. This indicates that the assumption of normality and the
homogeneity of variance of the standardised residuals have largely been satisfied. The
standardised residuals are indeed normally and independently distributed N(0,1) with mean
0 and variance 1 as the slope of the best fit line in the QQ-plot, which can be interpreted as
the standard deviation of the normal score (Z score) of the quantile, should approach 1 and
the intercept, which is the mean of the normal score of the quantile should approach 0 as
the number of sites increases. It can be observed from Figures 21 and 22 that the fitted lines
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
1 2 3 4 5 6 7 8
Fitted LN(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (ROI) BGLSR-PRT (ROI)
CHAPTER 5
124
for the developed models pass through the origin (0, 0) and have a slope approximately
equal to one. The ROI approach approximates the normality of the residuals slightly better
(i.e. a better match with the fitted line) than the fixed region approach. Similar results were
also found for the mean, standard deviation, skew and other flood quantile models, which
are not shown in this thesis due to space constraints.
ARI 20 (FIXED REGION)
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT
BGLSR-PRT
Figure 21 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, fixed
region, Tasmania)
ARI 20 (ROI)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT
BGLSR-PRT
Figure 22 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, ROI,
Tasmania)
To assess the adequacy of the BGLSR models, Cook’s distance values were also
calculated. No outlier/influential sites were found for the mean, standard deviation and
flood quantile models. For the skew model (Figure 23), sites 8 and 50 were above the
threshold value of 0.076 (i.e. 4/53, where 53 is the total number of sites). Site 8 showed the
CHAPTER 5
125
largest standardised residual value. The flow data, site history and flood frequency plots of
these two sites were examined. It was found that site 8 had a record length of 33 years (in
the top 20%) and a very small annual maximum flow value in 1968, which was not
surprising as this was a drought year. This small flow caused a high negative skew of -1.60
for the site. Site 50 had record length of 46 years (5th largest record length) and a skew
value 1.15, and it did show the largest influence value (Figure 23). The regression analysis
was repeated by removing these two sites. Indeed site 8 did influence the analysis with a
notable decrease in the expected MEV ( 2 ) from 0.052 to 0.034. The AVPO and AVPN
dropped notably from 0.073 and 0.067 to 0.053 and 0.049, respectively. The 2GLSR
also
increased from 36% to 53%, which is deemed to be a remarkable increase. The effective
record length based on AVPN of 0.049 in this case is 122 years, which is nearly 4 times the
average record length for Tasmania. Site 8 did therefore influence the results notably and
was therefore removed from the database in subsequent analyses. The removal of site 50
resulted in little improvement in the skew model with a negligible increase in 2GLSR (55%)
and a slightly smaller 2 (0.032) and was therefore retained.
Figure 23 Cook’s distance (Di) for locating outlier sites for skew model based on variable combination 4
The summary of various regression diagnostics (the relevant equations are described in
section 3.8) is provided in Table 10. This shows that for the mean flood model, the MEV
and average standard error of prediction (SEP) are much higher than those of the standard
deviation and skew models. This indicates that the mean flood models exhibits a higher
0
0.05
0.1
0.15
0.2
0.25
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
Site No.
GLSR Cooks D
CHAPTER 5
126
degree of heterogeneity than the standard deviation and skew models, this result is also
supported by the ANOVA analysis. Indeed the issue here is that sampling error becomes
larger as the order of the moment increases, therefore, in case of the skew, the spatial
variation is a second order effect that is not really detectable. For the mean flood model, the
ROI shows a MEV which is 11% smaller than for the fixed region. Also, the 2
GLSR value for
the mean flood model with the ROI is 2% higher than for the fixed region. The reasonable
reduction in MEV alone indicates that the ROI should be preferred over the fixed region
analysis for developing the mean flood model. For the standard deviation model, ROI also
shows 8% smaller SEP and 5% higher 2
GLSR values. This indicates that the ROI is preferable
to the fixed region for the standard deviation model. What is also noteworthy (as seen from
Table 10) is that the SEP% for the skew model is slightly larger for the ROI than the fixed
region analysis. This may be due to the fact that, if the number of sites are reduced (smaller
ROI), the predictive variance may be slightly inflated in the skew region. The 2
GLSR values
for the skew models are similar for the fixed region and ROI, with the latter providing only
a 2% increase.
One can see from Table 10 that the SEP values for all the flood quanitle models are 2% to
11% smaller for the ROI cases than the fixed region; the best result is obtained for ARI = 2
years. Also, the 2
GLSR values for ROI cases are 3% to 6% higher than the fixed region. These
results show that the ROI generally outperforms the fixed region approach.
Table 10 Regression diagnostics for fixed region and ROI for Tasmania
Model Fixed region ROI
MEV AVP SEP (%)
2
GLSR (%)
MEV AVP SEP (%) 2
GLSR (%)
Mean 0.35 0.37 67 86 0.24 0.27 56 88
Stdev 0.071 0.076 28 51 0.042 0.046 20 56
Skew 0.034 0.050 22 52 0.031 0.050 23 54
Q2 0.55 0.59 83 76 0.38 0.419 72 79 Q5 0.33 0.36 61 82 0.25 0.28 57 86 Q10 0.30 0.32 58 84 0.23 0.26 54 87 Q20 0.30 0.33 58 83 0.23 0.26 55 87 Q50 0.34 0.37 62 82 0.27 0.30 60 86 Q100 0.37 0.40 66 79 0.30 0.34 64 85
CHAPTER 5
127
5.2.4 POSSIBLE SUBREGIONS IN TASMANIA
Table 11 shows the number of sites and associated MEVs for the ROI and fixed region
models. This shows that the ROI mean flood model has fewer sites on average (28 out of
52 sites i.e. 54%) than the standard deviation and skew models. The ROI skew model has
the highest number of sites which includes nearly all the sites in Tasmania (50 out of 52 i.e.
96%). The MEVs for all the ROI models (except the skew model) are smaller than the
fixed region models. This shows that the fixed region models experience a greater
heterogeneity than the ROI. If the fixed regions are made too large, the model error will be
inflated by heterogeneity that will go unaccounted for by the catchment characteristics.
Figure 24 shows the resulting sub-regions in Tasmania (with minimum MEVs) for the ROI
mean flood and skew models. For the mean flood and skew models, there are two distinct
sub-regions. The regions can be classified as east and west Tasmania for which there are
two distinct types of rainfall regimes and districts. The significance of this is that if spatial
variations do exist in the hydrological statistic of interest, they are most likely to be
captured by the ROI, as has been the case in this study for Tasmania. The results of this
analysis concur with previous studies (McConachy et al., 2003, Gamble et al., 1998,
Xuereb et al., 2001) which showed that large rainfalls over Tasmania are not
meteorologically homogeneous. In the east of the state, the largest rainfall events occur in
the warmer spring and summer months when low pressure systems in the Tasman Sea can
direct an easterly onshore air flow over Tasmania. The heaviest rainfalls in the west of the
state are due to the passage of fronts, sometimes associated with an intense extratropical
cyclone with a westerly or southwesterly airstream (Xuereb et al., 2001).
Table 11 Model error variances associated with fixed region and ROI for Tasmania (n =
number of sites in the region)
Parameter/ quantiles
Mean Stdev Skew Q2 Q5 Q10 Q20 Q50 Q100
ROI (n) 2ˆ
28 0.24
36 0.042
50 0.031
30 0.38
35 0.25
35 0.23
34 0.23
33 0.27
33 0.30
Fixed region (n) 2ˆ
52 0.35
52 0.067
52 0.034
52 0.55
52 0.33
52 0.30
52 0.30
52 0.34
52 0.37
CHAPTER 5
128
Figure 24 Spatial variations of the grouped minimum model error variances for Tasmania (a) mean
flood model and (b) skew model
5.2.5 EVALUATION STATISTICS
Table 12 presents the relative root mean square error (RMSEr) (Equation 3.45) and relative
error (REr) (Equation 3.44) values for the PRT and QRT models with both the fixed region
and ROI. In terms of RMSEr, ROI clearly gives smaller values than the fixed regions for all
the ARIs. The PRT-ROI shows smaller RMSEr values than the QRT-ROI for all the ARIs,
however for ARIs of 5, 10 and 20 years, the increase is noticeable (i.e. 20 to 30 %). In
terms of REr, ROI gives up to 9% smaller values than the fixed regions. The PRT-ROI
gives larger values of REr (by 13%) for both the 50 and 100 years ARIs. For ARIs of 2 to
20 years, the QRT-ROI gives smaller REr values (by 1% to 13%) than the PRT-ROI.
Finally the results of counting the Qpred/Qobs (rr) ratios for the QRT and PRT for the ROI
and fixed regions are provided in Tables 13 and 14. The QRT-ROI has 85% of the rr values
in the desirable range, compared to 81% for the QRT-fixed region. The PRT-ROI has 78%
of the rr values in the desirable range, compared to 74% for the PRT-fixed region. These
results show that ROI performs better than the fixed regions with both the QRT and PRT.
The PRT-ROI shows 16% underestimation as compared to 8% for the QRT-ROI. The cases
with overestimation were very similar for both the methods.
(a) (b)
CHAPTER 5
129
Table 12 Evaluation statistics (RMSEr and REr) from leave-one-out (LOO) validation for
Tasmania
Model RMSEr (%) REr (%) PRT QRT PRT QRT Fixed
region ROI Fixed
region ROI Fixed
region ROI Fixed
region ROI
Q2 110 100 160 120 33 31 38 30 Q5 90 70 110 80 35 30 34 25 Q10 100 70 110 80 34 37 30 24 Q20 100 70 130 90 36 37 27 27 Q50 110 70 130 100 39 41 29 28 Q100 120 70 130 100 49 42 33 29
Table 13 Summary of counts/percentages based on the rr values for QRT and PRT for
Tasmania (fixed region). “U” = gross underestimation, “D” = desirable range and “O” =
gross overestimation
Count (QRT) Percent (QRT) Count (PRT) Percent (PRT) Model
U D O U D O U D O U D O Q2 2 41 9 4 79 17 5 41 6 10 79 12 Q5 2 44 6 4 85 12 6 41 5 12 79 10 Q10 3 46 3 6 88 6 6 41 5 12 79 10 Q20 4 45 3 8 87 6 9 37 6 17 71 12 Q50 6 40 6 12 77 12 10 36 6 19 69 12 Q100 9 38 5 17 73 10 10 36 6 19 69 12
Sum /
average 26 254 32 8 81 10 46 232 34 15 74 11
Table 14 Summary of counts/percentages based on the rr values for QRT and PRT for
Tasmania (ROI). “U” = gross underestimation, “D” = desirable range and “O” = gross
overestimation
Count (QRT) Percent (QRT) Count (PRT) Percent (PRT) ARI
(years) U D O U D O U D O U D O 2 3 45 4 6 87 8 6 43 3 12 83 6 5 2 45 5 4 87 10 7 42 3 13 81 6
10 3 45 4 6 87 8 9 41 2 17 79 4 20 4 45 3 8 87 6 9 40 3 17 77 6 50 6 42 4 12 81 8 9 39 4 17 75 8 100 6 42 4 12 81 8 9 39 4 17 75 8
Sum /
average 24 264 24 8 85 8 49 244 19 16 78 6
CHAPTER 5
130
5.3 SECTION SUMMARY
This section of the thesis has compared the fixed region and ROI approaches for the state
of Tasmania. A BGLSR approach was used to develop prediction equations for flood
quantiles of ARIs of 2 to 100 years (for QRT) and the first three parameters of the LP3
distribution (for PRT). It has been found that area and design rainfall intensity are
significant predictors for both the QRT and PRT based prediction equations. When
compared to the fixed region approach, the ROI with both QRT and PRT shows
improvements by reducing the negative influence of regional heterogeneity, with a
decrease in the model error variance, average standard error of prediction and an increase
in the average pseudo 2GLSR . Both the standardised residual and QQ-plots of the ROI
approach satisfy the underlying model assumptions slightly better than those of the fixed
region. It has also been observed that both the QRT-ROI and PRT-ROI produce similar
average root mean square error, median relative error and median Qpred/Qobs ratio values.
Overall, the PRT-ROI and QRT-ROI have performed very similarly for Tasmania. The
ROI approach outperforms the fixed region approach for Tasmania.
5.4 RESULTS FOR NEW SOUTH WALES, VICTORIA AND QUEENSLAND
The analysis undertaken in this section makes use of observed AMFS data of catchments
ranging in areas from 3 to 1010 km2. The finally selected data set consists of n = 399
catchments (Figure 16) with AMFS record lengths ranging from 25 to 94 years (maximum
record length for New South wales (NSW): 75 years, mean and standard deviation: 37 and
11 years, respectively; maximum record length for Victoria (VIC): 52 years, mean and
standard deviation: 33 and 5 years, respectively and maximum record length for
Queensland (QLD): 94 years, mean and standard deviation: 40 and 15 years, respectively).
In the fixed region approach, all the catchments within a state boundary were considered to
have formed one region; however, one catchment was left out for cross-validation and the
procedure was repeated n times to implement the LOO validation scheme. In the ROI
approach, an optimum region was formed for each of the n catchments by starting with 15
stations and then consecutively adding 5 stations at each iteration (see section 3.7 for more
details).
CHAPTER 5
131
5.4.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT
The stepwise procedure for selecting the best set of catchment characteristics predictors
resulted in the following equations for the LP3 mean (た), standard deviation (j), skewness
(け) and the flood quantiles (QARI) for each of the states of NSW, VIC and QLD. The
regression equations are presented in general form below, while the final results of the
equations for NSW are provided in Table 15. The final results of VIC and QLD can be seen
in Appendix B.
= 0 + 1(area) + 2(2I12) for NSW, VIC and QLD (5.10)
= 0 - 1(rain) - 2(S1085) for NSW (5.11)
= -0 - 1(area) - 2(forest) for NSW (5.12)
= 0 - 1(rain) + 2(evap) for VIC (5.13)
= -0 + 1(rain) - 2(evap) for VIC (5.14)
= 0 - 1(area) - 2(2I1) for QLD (5.15)
= -0 - 1(50
I72) + 2(rain) for QLD (5.16)
ln(QARI) = 0 + 1(area) + 2(Itc,ARI) for NSW, VIC and QLD (5.17)
Tables 16 and 17 summarise the model error variance (MEV) as expressed by its posterior
mean value, for the regional models of the three LP3 parameters and the flood quantiles Q2,
Q10 and Q100 for each of the selected combinations of catchment characteristics for NSW.
Table 15 Summary of the final BGLSR results for NSW
Posterior moment BGLSR model (NSW) Regression coefficient
Mean Standard deviation
j2 0.29 0.051
く0 (constant) 4.09 0.092 く1 (area) 0.67 0.053
Mean (µ)
く2 (2I12) 2.31 0.21
j2 0.067 0.013 Standard deviation (j)
く0 (constant) 1.25 0.12 く1 (rain) -0.61 0.11 く2 (S1085) -0.13 0.040
j2 0.0125 0.012 Skewness (け) く0 (constant) -0.42 0.072
く1 (area) -0.092 0.048 く2 (forest) -0.094 0.053
Flood quantiles j2 0.31 0.055
QARI=2 く0 (constant) 4.06 0.13 く1 (area) 1.26 0.086
CHAPTER 5
132
く2 (Itc,ARI =2) 2.42 0.24 QARI=5 j2 0.23 0.042
く0 (constant) 5.11 0.092 く1 (area) 1.19 0.072 く2 (Itc,ARI =5) 2.08 0.20
QARI=10 j2 0.23 0.045
く0 (constant) 5.56 0.10 く1 (area) 1.14 0.074 く2 (Itc,ARI =10) 1.93 0.21
QARI=20 j2 0.25 0.050
く0 (constant) 5.91 0.11 く1 (area) 1.09 0.078 く2 (Itc,ARI =20) 1.79 0.22
QARI=50 j2 0.35 0.060
く0 (constant) 6.55 0.13 く1 (area) 1.01 0.081 く2 (Itc,ARI =50) 1.73 0.24
QARI=100 j2 0.35 0.075
く0 (constant) 6.47 0.34 く1 (area) 0.97 0.12 く2 (Itc,ARI =100) 1.50 0.29
Also provided in Tables 16 and 17 is the summary of the statistical measures used i.e.
AVPO and AVPN, AIC, BIC, BPV and pseudo R2 ( 2GLSR ) to assess the best combination of
catchment characteristics to predict the three parameters and flood quantiles of the LP3
distribution. Figure 25 shows the MEV, standard error of the MEV and 2GLS
R values for the
skew model. Combination 9 with a constant and two predictor variables area and forest
showed the lowest MEV and the highest 2GLSR
as well as the lowest AIC and BIC values.
However, the lowest AVPO and AVPN values were found for combination 1 (a constant
value, representing the intercept term in the regression model - see Figure 25).
The BPV values were used to carry out a hypothesis test (at the 5% significance level) on
the predictors of combination 9. The BPVs were found to be 6% and 7% for area and
forest, respectively, while this showed the variables are not to be significant; however,
these values are not considered to be notably high. Both the posterior coefficients く1 and く2
values were smaller than two posterior standard deviations (for the respective case) away
from zero supporting the results from the BPV test that these variables are not really
significant.
In this case, it may be possible to adopt a regional average skew value for the entire NSW
state without using any prediction equation/predictor variable in the regression equation.
This finding is consistent with Gruber and Stedinger (2008) who found that a constant
CHAPTER 5
133
model for a regional skewness was the best model for a large region in the southeastern
part of the United States. This is also supported by the fact that there was only a modest
difference in the MEV values. Combination 9 and 1 however were both adopted and tested
in this study with the PRT approach.
A similar outcome was observed for the standard deviation model where the MEVs were
very similar for combinations 12 and 1 (figure not shown due to space constraint).
Combination 12 was adopted that had slope and rain as predictor variables. Indeed, AVPO,
AVPN, BIC and AIC values were the lowest for this combination. Both the posterior
coefficients く1 and く2 were well established in the regression equations being more than
two times the respective posterior standard deviation away from zero. The BPVs were 2%
indicating the relatively higher significance of these two variables.
For the mean flood, combination 6 (constant, area and I12,2) had the smallest MEV. The
posterior coefficients of く1 and く2 in this combination were at least 5 and 11 times the
respective posterior standard deviation away from zero, which shows that く1 and く2 are well
established in the prediction equation. Indeed, all the statistical criteria were found to be in
favour of combination 6.
Figure 25 Selection of predictor variables for the BGLSR model for the skew (note that 2GLSR uses the
right-hand axis)
0.0110
0.0115
0.0120
0.0125
0.0130
0.0135
0.0140
0.0145
0.0150
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Combination of Catchment Characterisitcs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7MEV Standard Error of MEV R-sqd GLS
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
AVPO AVPN AIC BIC
CHAPTER 5
134
Figure 26 shows an example plot of the statistics used in selecting the best set of predictor
variables for the fixed region flood quantile (QRT) models. According to the MEV values,
combinations 19, 18, 20, 23, 16, 6, 4, 25 and 10 were potential sets of predictor variables
for the Q10 model. Combinations 18, 19, 20 and 23 contained 3 to 4 predictor variables
while combinations 16, 6, 4, 25 and 10 contained 2 predictor variables with similar MEVs
and 2GLSR values.
The AVPO, AVPN, AIC and BIC values all favoured combination 10, and hence this was
finally selected as the best set of predictor variables for the Q10 model which includes area
and design rainfall intensity Itc,10. Both posterior coefficients く1 and く2 were found to be 9
times the respective posterior standard deviation away from zero suggesting that these two
variables are well established in the prediction equation. Indeed, based on similar findings,
combination 10 was selected for all the flood quantile prediction equations (ARIs = 2 – 100
years). The BPVs for the regression coefficients associated with the variable area and
design rainfall intensity Itc,ARI for the QRT over all the ARIs were found to be significant
with values smaller than 0.01.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Combination of Catchment Characteristics
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%MEV Standard Error of MEV R-sqd GLSR
CHAPTER 5
135
Figure 26 Selection of predictor variables for the BGLSR model for Q10 model (note that uses the
right-hand axis), (QRT, fixed region NSW), MEV = model error variance, AVPO = average variance of
prediction (old), AVPN = average variance of prediction (new) AIC = Akaike information criteria, BIC
= Bayesian information criteria
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Combination of Catchment Characteristics
AVPO AVPN AIC BIC
CHAPTER 5
136
Table 16 Summary of the catchment characteristics and statistical measures used in the stepwise regression for the parameters of the LP3 distribution for NSW
LP3 parameter Combination Catchment
characteristicsa
Mean Standard deviation Skewness
j2 AVPO AVPN AIC BIC BPV %
2GLSR j2 AVPO AVPN AIC BIC BPV
%
2GLSR j2 AVPO AVPN AIC BIC BPV
%
2GLSR
1 Const 0.92 0.94 0.92 1.22 1.22 0 0% 0.099 0.10 0.10 0.13 0.13 0 0% 0.0135 0.019 0.018 0.156 0.156 <0.1 0%
2 Const, area 0.69 0.71 0.68 0.76 0.78 0, 0 39% 0.098 0.10 0.10 0.13 0.13 0, 10 4% 0.0132 0.021 0.021 0.080 0.082
<0.1, 3 50%
3 Const, area, 2I1 0.36 0.38 0.35 0.34 0.35
0, 0, 0 74% 0.097 0.10 0.10 0.13 0.13
0,13, 19 6% 0.0131 0.025 0.024 0.079 0.083
<0.1, 3, 68 52%
4 Const, area, 50I1 0.34 0.36 0.34 0.38 0.40
0, 0, 0 70% 0.096 0.10 0.10 0.13 0.13
0,10, 20 6% 0.0131 0.025 0.024 0.079 0.083
<0.1, 3, 72 52%
5 Const, area, 50I12 0.30 0.31 0.29 0.32 0.34
0, 0, 0 75% 0.094 0.10 0.09 0.12 0.13
0,13, 10 8% 0.0132 0.025 0.024 0.080 0.084
<0.1, 3, 72 51%
6 Const, area, 2I12 0.28 0.30 0.28 0.31 0.32
0, 0, 0 76% 0.091 0.10 0.09 0.12 0.13
0,14, 6 10% 0.0133 0.025 0.024 0.082 0.086
<0.1, 3, 86 50%
7 Const, area, S1085
0.63 0.66 0.62 0.70 0.74
0, 0, 0.4 45% 0.091 0.10 0.09 0.12 0.13
0,29, 8 8% 0.0135 0.024 0.023 0.083 0.087
<0.1, 4, 92 49%
8 Const, area, sden
0.60 0.63 0.59 0.54 0.57
0, 0, 0.6 58% 0.099 0.10 0.10 0.13 0.14
0,14, 58 4% 0.0134 0.024 0.023 0.083 0.088
<0.1, 4, 81 49%
9 Const, area, forest
0.69 0.72 0.68 0.78 0.82
0, 0, 60 39% 0.091 0.10 0.09 0.12 0.13
0,5, 7 9% 0.0126 0.024 0.023 0.057 0.060
<0.1, 6, 7 65%
10 Const, area, evap
0.34 0.35 0.33 0.39 0.41
0, 0, 0.1 69% 0.098 0.10 0.10 0.13 0.13
0,14, 26 6% 0.0133 0.026 0.025 0.076 0.080
<0.1, 2, 49 53%
11 Const, area, rain 0.29 0.31 0.29 0.31 0.33
0, 0, 0.1 76% 0.078 0.08 0.08 0.10 0.10
0,40, 1 26% 0.0134 0.025 0.024 0.082 0.087
<0.1, 2, 87 49%
12 Const, rain, S1085
0.92 0.96 0.90 1.24 1.31
0,37, 16 2% 0.066 0.07 0.07 0.09 0.09
0,2, 1 35% 0.0140 0.025 0.025 0.148 0.156
0,74, 87 10%
13 Const, sden, S1085
0.91 0.94 0.89 1.15 1.21
0,0.8, 82 9% 0.090 0.09 0.09 0.12 0.13
0,60, 5 8% 0.0139 0.025 0.024 0.140 0.148
0,74, 51 14%
14 Const, evap, sden
0.88 0.92 0.86 1.05 1.11
0,0.1, 36 18% 0.098 0.10 0.10 0.13 0.14
0,27, 61 3% 0.0137 0.026 0.025 0.135 0.143
0,50, 38 17%
15 Const, forest 0.91 0.94 0.90 1.17 1.21 0, 3 6% 0.093 0.10 0.09 0.13 0.13 0, 11 4% 0.0127 0.021 0.020 0.078 0.080 0, 4 51%
16 Const, S1085, forest
0.91 0.95 0.89 1.18 1.24
0, 17, 2 7% 0.088 0.09 0.09 0.12 0.13
0, 7, 32 9% 0.0127 0.024 0.023 0.065 0.069
0, 17, 2 60%
aConst is a constant term. Refer to text in Chapter 4 for a full description of the catchment characteristics predictor variables.
CHAPTER 5
137
Table 17 Summary of the catchment characteristics and statistical measures used in the forward stepwise regression for the flood quantiles of the LP3
distribution (ARIs = 2, 10 and 100 years) for NSW
LP3 flood quantiles Combinatio
n
Catchment characteristic
sa
ARI = 2 ARI = 10 ARI = 100
j2 AVPO
AVPN
AIC
BIC
BPV %
2GLSR
j2 AVPO
AVPN
AIC
BIC
BPV %
2GLSR
j2 AVPO
AVPN
AIC
BIC
BPV %
2GLSR
1 Const 0.94 0.96 0.94
1.26
1.26
0 0%
0.89 0.91 0.89
1.16
1.16
0 0% 0.87 0.89 0.87
1.21
1.21
0 0%
2 Const, area 0.73 0.75 0.72
0.78
0.81
0, 0, 0 39%
0.54 0.56 0.53
0.52
0.53
0, 0, 0 56% 0.52 0.54 0.52
0.64
0.66
0, 0, 0 48%
3 Const, area, 2I1 0.35 0.37 0.34
0.38
0.40
0, 0, 0 71%
0.23 0.25 0.24
0.26
0.28
0, 0, 0 78% 0.35 0.38 0.36
0.42
0.45
0, 0, 0 67%
4 Const, area,
2I12
0.31 0.33 0.31
0.33
0.35
0, 0, 0 75%
0.23 0.24 0.23
0.26
0.27
0, 0, 0 78% 0.35 0.37 0.35
0.36
0.38
0, 0, 0 72%
5 Const, area,
50I1
0.34 0.36 0.34
0.36
0.38
0, 0, 0 73%
0.25 0.27 0.25
0.28
0.29
0, 0, 0 77% 0.35 0.38 0.36
0.42
0.44
0, 0, 0 67%
6 Const, area,
50I12
0.31 0.33 0.31
0.33
0.35
0, 0, 0 75%
0.22 0.24 0.23
0.25
0.27
0, 0, 0 79% 0.35 0.38 0.36
0.41
0.43
0, 0, 0 68%
7 Const, area,
S1085
0.74 0.77 0.73
0.80
0.85
0, 0, 69 39%
0.54 0.57 0.53
0.52
0.55
0, 0, 34 56% 0.52 0.55 0.52
0.65
0.69
0, 0, 63 48%
8 Const, area,
sden
0.66 0.69 0.65
0.72
0.76
0, 0, 0.3 45%
0.46 0.49 0.46
0.55
0.58
0, 0, 0.2
55% 0.49 0.52 0.49 0.63
0.66
0, 0, 0.5 50%
9 Const, area,
sden, forest
0.65 0.68 0.63
0.72
0.78
0, 0, 1, 9 46%
0.48 0.51 0.47
0.56
0.61
0, 0, 1, 90 54% 0.49 0.52 0.48
0.63
0.69
0, 0, 1, 20 51%
10 Const, area,
Itc,ARI
0.29 0.33 0.31
0.33
0.35
0, 0, 0 75%
0.23 0.24 0.23
0.26
0.27
0, 0,
0 79% 0.35 0.38 0.36 0.44
0.46
0, 0, 0 65%
11 Const, area,
forest
0.69 0.72 0.67
0.76
0.80
0, 0, 2 42%
0.54 0.57 0.54
0.51
0.54
0, 0, 40 57% 0.53 0.56 0.52
0.65
0.69
0, 0, 59 48%
CHAPTER 5
138
12 Const, area,
evap
0.61 0.64 0.60
0.65
0.69
0, 0, 0.2 50%
0.38 0.40 0.38
0.38
0.40
0, 0, 0
69% 0.45 0.48 0.45 0.59
0.63
0, 0, 0.4 53%
13 Const, area,
rain
0.34 0.36 0.34
0.36
0.38
0, 0, 0.2 73%
0.35 0.37 0.35
0.43
0.45
0, 0, 0
64% 0.40 0.43 0.41 0.50
0.53
0, 0, 0.1 61%
14 Const, rain,
S1085
0.90 0.94 0.88
1.06
1.12
0, 0, 4 19%
0.86 0.90 0.85
1.07
1.12
0, 6,
1 11% 0.85 0.89 0.83 1.17
1.23
0, 36, 0.7 8%
15 Const, sden,
S1085
0.93 0.97 0.91
1.21
1.28
0, 15, 2 8%
0.88 0.91 0.86
1.10
1.16
0,
25,0.1 9% 0.85 0.89 0.84
1.16
1.22
0, 27,0.
1 8%
16 Const, area,
50I12, S1085
0.37 0.39 0.36
0.23
0.25
0, 0, 0, 40 83%
0.22 0.24 0.22
0.26
0.28
0, 0, 0, 35 79% 0.35 0.38 0.35
0.42
0.46
0, 0, 0, 62 67%
17 Const, area,
50I12, rain
0.29 0.31 0.29
0.32
0.35
0, 0, 0, 0.4 76%
0.23 0.25 0.23
0.26
0.28
0, 0, 0, 22 79% 0.35 0.38 0.35
0.42
0.46
0, 0, 0, 28 67%
18 Const, area,
50I12, S1085,
forest
0.37 0.39 0.36
0.25
0.28
0, 0, 0, 48, 79 72%
0.21 0.24 0.22
0.25
0.28
0, 0, 0, 55,
75
80% 0.35 0.38 0.35 0.33
0.37
0, 0, 0, 55,
79 75%
19 Const, area,
50I12, Itc,ARI,
forest
0.37 0.39 0.35
0.22
0.25
0, 0, ,15, 16,7
0 74% 0.21 0.24 0.21
0.25
0.28
0, 0, ,22,
43,70
80% 0.34 0.38 0.35 0.33
0.37
0, 0, ,10,
80,90 75%
20 Const, area,
50I12, Itc,ARI,
S1085, forest
0.37 0.40 0.35
0.24
0.28
0, 0, 15, 18, 70,7
8 73% 0.22 0.24 0.22
0.26
0.30
0, 0, 23, 44,
95,90 80% 0.35 0.39 0.35
0.36
0.42
0, 0, 27, 90,
95,90 73%
21 Const, area,
Itc,ARI, rain
0.30 0.32 0.29
0.32
0.35
0, 0, 0, 2 76%
0.23 0.25 0.23
0.26
0.29
0, 0, 0, 76 78% 0.35 0.38 0.35
0.44
0.48
0, 0, 0, 81 66%
22 Const, area,
Itc,ARI, evap
0.32 0.34 0.31
0.34
0.37
0, 0, 0, 86 74%
0.23 0.25 0.23
0.26
0.29
0, 0, 0, 80 79% 0.35 0.39 0.36
0.45
0.49
0, 0, 0, 95 65%
23 Const, area,
Itc,ARI, forest
0.37 0.39 0.36
0.23
0.25
0, 0, 0, 98 73%
0.22 0.24 0.22
0.25
0.27
0, 0, 0, 8 79% 0.35 0.38 0.35
0.40
0.43
0, 0, 0, 98 69%
CHAPTER 5
139
24 Const, area,
Itc,ARI, S1085
0.37 0.39 0.36
0.23
0.25
0, 0, 0, 92 73%
0.23 0.25 0.23
0.26
0.29
0, 0, 0, 50 79% 0.35 0.38 0.35
0.45
0.49
0, 0, 0, 95 65%
25 Const, area,
2I1, Itc,ARI
0.32 0.34 0.31
0.35
0.38
0, 0, 46, 0 74%
0.23 0.25 0.23
0.26
0.28
0, 0, 59, 1 79% 0.35 0.38 0.35
0.43
0.47
0, 0, 49, 0 67%
aConst is a constant term. Refer to text in Chapter 4 for a full description of the catchment characteristics predictor variables.
CHAPTER 5
140
5.5 REGION OF INFLUENCE VS. FIXED REGIONS FOR PARAMETER
AND QUANTILE REGRESSION TECHNIQUES
5.5.1 REGRESSION DIAGNOSTICS – PSEUDO ANALYSIS OF VARIANCE
The pseudo analysis of variance (ANOVA) tables for the Q20 model and the parameters of
the LP3 distribution (mean and skew are shown only due to space constraint) are presented in
Tables 18 to 20 for the fixed and ROI regions for NSW, VIC and QLD. The pseudo ANOVA
table describes how the total variation among the iy values (predicted values) can be
apportioned between that explained by the model error and sampling error. This is an
extension of the ANOVA in the OLSR which does not recognise and correct for the expected
sampling variance (Reis et al., 2005). An error variance ratio (EVR) is used in Pseudo
ANOVA, which is the ratio of sampling error variance to model error variance. An EVR of
greater than 0.20 may indicate that the sampling variance is not negligible when compared to
the model error variance, which suggests the need for a GLSR analysis (Gruber et al., 2007).
For the LP3 parameters, the sampling error (i.e. EVR) increases as the order of moment
increases, this can be clearly seen for all the three states in Tables 18 and 19. For example,
for NSW the EVR for the mean flood model for ROI is 0.3 (i.e. the sampling error is only 0.3
times of the model error) (see Table 18), the corresponding EVR value for the skew model
(Table 19) is 18 (i.e. the sampling error is 18 times of the model error). The ROI shows a
reduced model error variance for all the three states (i.e. a reduced heterogeneity), in
particular for the mean flood model, as compared to the fixed regions. For example, for NSW
state (Table 18) the model error variances for the fixed region and ROI are 27.7 and 16.5,
respectively. It was found that the model error dominated the regional analysis for the mean
flood and the standard deviation models (results not shown) for both the fixed regions and
ROI for all the states.
For the ROI, the mean flood model also shows a much higher model error variance than
those of the standard deviation and skew models. These results based on the model error
variance alone indicate that the mean flood has the greater level of heterogeneity associated
with its regionalisation as compared to the standard deviation and skew. The ROI, however
shows a higher EVR than the fixed regions e.g. for the mean flood model for NSW, the EVR
is 0.30 for the ROI and 0.17 for the fixed region (see Table 18), Table 18 also provides the
CHAPTER 5
141
EVR results for VIC and QLD states, which show a similar outcome as of NSW. For the
standard deviation model for NSW the EVR is 0.77 for the ROI and 0.35 for the fixed region,
again similar results were found for VIC and QLD states as of NSW.
The EVR values for the skew models of NSW, VIC and QLD are shown in Table 18. It can
be observed from Table 18 that the EVR values range from 8 to 19 and 9.5 to 19 for the fixed
regions and ROI, respectively (Table 19), which are much higher than the recommended
limit of 0.20. In this relation, two important points may be noted below:
(i) This result clearly indicates that the GLSR is the preferred modeling option over
the OLSR for the skew model. An OLSR model for the skew would have clearly
given misleading results as it does not distinguish between the model and
sampling errors as found in similar previous studies (e.g. Reis et al., 2005 and
Haddad et al., 2010b).
(ii) Importantly, what is clear is that if a method of moment estimator was used to
estimate the model error variance ( 2 ) for the skew model, the model error
variance would have been grossly underestimated as the sampling error heavily
dominated the regional analysis. A more reasonable estimate of the model error
variance has been achieved with the Bayesian procedure as it represents the
values of 2 by computing expectations over the entire posterior distribution.
Similar results were found by Reis et al. (2005), Gruber and Stedinger (2008) and
Haddad et al. (2010b). As far as the ROI approach is concerned there is little
change in the EVR values as compared to the fixed region approach for all the
three states as the skew model tends to include more stations in the regional
analysis.
Table 18 Pseudo ANOVA table for the mean flood model (PRT, fixed region and ROI,
NSW, VIC and QLD states) (Here n = number of sites in the region, k = number of predictors
in the regression equation, EVR = error variance ratio, 2
0 = model error variance when no
predictor variable is used in the regression model, 2 = model error variance when predictor
variable is used in the regression model and )]ˆ([ ytr = sum of the diagonals of the sampling
covariance matrix)
CHAPTER 5
142
Table 19 Pseudo ANOVA table for the skew model (PRT, fixed region and ROI, NSW, VIC
and QLD states) (variables are explained in Table 18 caption)
Source Degrees of freedom Sum of squares
NSW Fixed region ROI
Fixed region ROI
Model k=3 k=3 n )( 22
0 61.5 61.2
Model error n-k-1=92 n-k-1=32 n )( 2 27.7 16.5
Sampling error n = 96 n = 36 )]ˆ([ ytr 5 4.5
Total 2n-1 = 191 2n-1 = 71 Sum of the above
94 83
EVR 0.17 0.3
VIC
Model k=3 k=3 46 45
Model error n-k-1=127 n-k-1=39 37.5 28
Sampling error n = 131 n = 43 6.1 6
Total 2n-1 = 261 2n-1 = 85 Sum of the above 90 79
EVR 0.16 0.2
QLD
Model k=3 k=3 105 102
Model error n-k-1=168 n-k-1=34 39 22
Sampling error n = 172 n = 38 10.2 9
Total 2n-1 = 343 2n-1 = 75 Sum of the above 155 133
EVR 0.26 0.40
Source Degrees of freedom Sum of squares
NSW Fixed region ROI
Fixed region ROI
Model k=3 k=3 n )( 22
0 0.1 0.1
Model error n-k-1=92 n-k-1=91 n )( 2 1.22 1.21
Sampling error n = 96 n = 95 )]ˆ([ ytr 24 23
Total 2n-1 = 191 2n-1 = 189
Sum of the above 25 23
EVR 19 18
VIC
Model k=3 k=3 6.5 7.3
Model error n-k-1=127 n-k-1=113 4.5 3.7
Sampling error n = 131 n = 117 38 35
Total 2n-1 = 261 2n-1 = 233
Sum of the above 49 48
EVR 8.4 9.5
QLD
Model k=3 k=3 0.11 0.65
Model error n-k-1=168 n-k-1=146 2.6 2.1
Sampling error n = 172 n = 150 45 40
Total 2n-1 = 343 2n-1 = 299 Sum of the above 48 43
EVR 17 19
CHAPTER 5
143
The pseudo ANOVA tables were also prepared for all the flood quantile models (i.e. QRT
models). The results for the Q20 for all the three states are shown in Table 20. Here the ROI
shows a higher EVR values than that of the fixed region. Also, the sampling error generally
increases with increasing ARIs. The reduction in the model error variance as seen in Table 20
for all the three states is due to the fact that ROI has found an optimum number of sites based
on the minimum model error variance which generally uses fewer sites than that of the fixed
region approach. This indeed suggests that sub regions may exist in larger state.
The flood quantile Q2 was found to experience the lowest EVR values for NSW and QLD for
both the fixed region and ROI as compared to the Q20 and Q100 model results. This reflects
the much greater spatial variability of the mean which is dominated by local catchment
factors (as compared to the higher moments). This is reflected in the Q2 flood as it is very
close to the mean flood magnitude. The Q20 shows an EVR of 0.43, 0.30 and 0.97,
respectively for NSW, VIC and QLD states (see Table 20) for ROI approach, which suggests
that the BGLSR combined with ROI should be the preferred option when modelling the
larger ARI quantiles, even though in this particular case the ROI has been impacted by the
relatively large model error variances that have dominated the regional flood quantile
modelling results.
Table 20 Pseudo ANOVA table for Q20 model (QRT, fixed region and ROI for NSW, VIC
and QLD states) (variables are explained in Table 18 caption)
Source Degrees of freedom Sum of squares
NSW Fixed region ROI
Fixed region ROI
Model k=3 k=3 n )( 22
0 61.1 61.1
Model error n-k-1=92 n-k-1=48 n )( 2 23.5 17.3
Sampling error n = 96 n = 52 )]ˆ([ ytr 7.6 7.0
Total 2n-1 = 191 2n-1 = 103
Sum of the above 92 86
EVR 0.32 0.43
VIC
Model k=3 k=3 45.2 45.2
Model error n-k-1=127 n-k-1=48 55.2 24.4
Sampling error n = 131 n = 52 7.4 7.2
Total 2n-1 = 261 2n-1 = 103
Sum of the above 108 77
EVR 0.13 0.30
QLD
CHAPTER 5
144
5.5.2 REGRESSION DIAGNOSTICS – MODEL ADEQUACY AND OUTLIER
ANANLYSIS
To assess the underlying model assumptions (i.e. the normality of residuals), the plots of the
standardised residuals [Equation (3.42)] vs. fitted quantiles were examined for all the flood
quantiles (estimated from QRT and PRT) and the parameters of the LP3 distribution for all
the three states. The predicted values were obtained from the LOO validation procedure.
Figure 27 shows the plot for the Q20 model for the state of NSW.
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
2 3 4 5 6 7 8Fitted ln(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (FIXED REGION) BGLSR-PRT (FIXED REGION)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
2 3 4 5 6 7 8 9
Fitted ln(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (ROI) BGLSR-PRT (ROI)
Figure 27 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT,
fixed region and ROI, NSW)
Model k=3 k=3 59 46
Model error n-k-1=168 n-k-1=77 25 12
Sampling error n = 172 n = 81 13 12
Total 2n-1 = 343 2n-1 = 161 Sum of the above 97 70
EVR 0.53 0.97
CHAPTER 5
145
If the underlying model assumption is satisfied to a large extent the standardised residual
values should not exceed the ± 2 limits; in practice, 95% of the standardised residuals should
fall between ± 2. The result in Figure 27 reveals that the developed flood quantiles from the
prediction equations satisfy the normality of residual assumption quite satisfactorily for both
the fixed and ROI approaches. Also no specific pattern (heteroscedasicity) can be identified
with the standardised values, which are being almost equally distributed below and above
zero. What is noteworthy is that ROI clearly provides fewer genuine outliers for both the
quantiles estimated by the QRT and PRT methods than the fixed region approach. This
indeed demonstrates the superiority of the ROI over the fixed region approach. Similar
results were observed for the states of VIC and QLD. The figures associated with VIC and
QLD can be seen in Appendix B.
The QQ-plots of the standardised residuals [Equation (3.42)] vs. normal score [Equation
(3.43)] for the fixed region (based on LOO validation) and ROI were then examined. The
results for the Q20 model for NSW are shown in Figure 28, which reveals that all the points
closely follow a straight line; this is especially noticeable for the ROI approach for both the
QRT and PRT methods. This indicates that the assumption of normality and the homogeneity
of variance of the standardised residuals are better approximated with the ROI approach.
Overall, no genuine outliers can be detected for the flood quantiles estimated by the QRT and
PRT on a regional scale.
ARI 20 (FIXED REGION)
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT BGLSR-PRT
CHAPTER 5
146
ARI 20 (ROI)
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT BGLSR-PRT
Figure 28 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, fixed
region, ROI, NSW)
If the standardised residuals are indeed normally and independently distributed N(0, 1) with
mean 0 and variance 1 then the slope of the best fit line in the QQ-plot, which can be
interpreted as the standard deviation of the normal score (Z score) of the quantile, should
approach 1 and the intercept, which is the mean of the normal score of the quantile should
approach 0 as the number of sites increases. Figure 28 indeed shows that the fitted lines for
the developed models pass approximately through the origin (0, 0) and have a slope
approximately equal to one. It can be seen that the results of the ROI approach satisfy the
model assumptions relatively better than the fixed region approach. The superiority of the
ROI approach again here is demonstrated. Similar results were observed for VIC and QLD
states. The figures associated with VIC and QLD can be seen in Appendix B. The
assumption of the normality of the residuals for all the three states (NSW, VIC and QLD)
could not be rejected at the 10% level of significance using the Anderson-Darling and
Kolmogorov-Smirnov tests for normality.
Below is presented the residual analysis results of the ROI method for the PRT using a
weighted regional average standard deviation and skew values, which are weighted by the
error covariance matrix (i.e. no predictor variables in the regression equation considered in
this case) for the state of NSW (as an example). The main aspect of this analysis is to
determine if there is any reasonable loss in accuracy and efficiency especially in the flood
quantile estimation of the mid to higher ARIs (i.e. 20 to 100 years) when using a weighted
regional average standard deviation and skew (obtained as above) as compared to ones with
CHAPTER 5
147
predictor variables. It should be stressed here that this weighted regional average standard
deviation and skew do vary from site to site as each site has a unique ROI.
The standardised residuals vs. the fitted quantile plot of Q20 is shown in Figure 29 that
superimposes the estimate made by the QRT-ROI, PRT-ROI and the PRT-ROI that uses a
weighted regional average standard deviation and skew estimate. Indeed, one can observe
that the PRT-ROI estimate of Q20 with the weighted regional average standard deviation and
skew performs equally well as the competing models. Nearly all the standardised residuals
fall within the 2 limits, suggesting that the use of predictor variables in the estimation of
standard deviation and skew does not really add much meaningful information to the
analysis. The QQ-plot (Figure 30) of the competing models shows that the use of a weighted
regional average standard deviation and skew does not result in any major gross errors in the
final quantile estimates. The residual analysis also reveals that the major assumptions of the
regression have been largely satisfied (i.e. normality of the residuals). The results based on
the evaluation statistics are given in section 5.5.4.
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
2 3 4 5 6 7 8 9
Fitted ln(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (ROI)
BGLSR-PRT (ROI)
BGLSR-PRT (ROI, Regional weighted average Stdev and Skew)
Figure 29 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT,
ROI and PRT-ROI with weighted average standard deviation and skew, NSW)
CHAPTER 5
148
ARI 20 (ROI)
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT BGLSR-PRT BGLSR-PRT (ROI, regional weighted ave, stdev and skew)
Figure 30 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, ROI, and
PRT ROI with weighted average standard deviation and skew, NSW)
5.5.3 DIAGNOSTIC STATISTICS
The summary of the various regression diagnostics (as described in section 3.8 and Equation
(3.41)) is provided in Table 21 for NSW, VIC and QLD states. This shows that for the mean
flood model (for all the three states), MEV and SEP are much higher than those of the
standard deviation and skew models. This indicates that the mean flood model exhibits a
higher degree of heterogeneity than the standard deviation and skew models. This result
supports the pseudo ANOVA results. Indeed the issue here is that sampling error becomes
larger as the order of the moment increases, therefore in case of the skew the spatial variation
is a second order effect (as compared to sampling variability) that it not really detectable, this
is apparent in both the fixed region and ROI cases.
CHAPTER 5
149
Table 21 Regression diagnostics for the fixed region and ROI for NSW, VIC and QLD
Model Fixed region ROI
NSW
MEV AVP SEP (%) 2GLSR (%) MEV AVP SEP (%) 2
GLSR (%)
Mean 0.29 0.31 60 76 0.19 0.23 51 84
Stdev 0.058 0.062 25 37 0.046 0.054 23 46
Skew 0.013 0.024 16 65 0.013 0.023 16 65
Q2 0.31 0.33 63 77 0.20 0.24 52 84 Q5 0.23 0.24 52 79 0.16 0.20 47 85 Q10 0.23 0.24 52 79 0.16 0.20 46 85 Q20 0.25 0.27 55 76 0.18 0.22 49 83 Q50 0.35 0.37 66 70 0.25 0.28 56 74 Q100 0.35 0.38 68 65 0.29 0.34 63 70
VIC
Mean
0.29 0.31 60 62 0.21 0.23 46 63 Stdev
0.044 0.049 22 65 0.041 0.050 21 65 Skew
0.034 0.040 20 70 0.028 0.037 19 73 Q2 0.27 0.28 57 63 0.20 0.23 51 65 Q5 0.29 0.31 60 61 0.20 0.23 50 64 Q10 0.35 0.37 67 57 0.23 0.26 54 61 Q20 0.35 0.37 67 57 0.19 0.22 48 66 Q50 0.47 0.49 80 49 0.27 0.32 61 61 Q100 0.59 0.60 91 45 0.29 0.35 64 54
QLD
Mean
0.23 0.24 52 77 0.14 0.15 40 78 Stdev
0.13 0.14 38 34 0.056 0.061 24 46 Skew
0.015 0.024 16 44 0.014 0.026 16 44 Q2 0.26 0.27 56 75 0.15 0.18 43 79 Q5 0.17 0.18 44 79 0.08 0.11 34 83 Q10 0.18 0.19 45 74 0.07 0.11 33 79 Q20 0.15 0.16 41 77 0.07 0.13 36 80 Q50 0.17 0.19 45 72 0.10 0.14 39 77 Q100 0.20 0.22 49 72 0.12 0.16 40 73
For the mean flood model (all the three states), the ROI shows a MEV which is smaller than
the fixed region analysis. The lower MEV in turn also provides the lower AVP values as can
be seen in Table 21. Also, the 2GLSR
values for the mean flood model (all the three states) with
the ROI case are 8%, 1% and 1% higher than the fixed region for NSW, VIC and QLD,
CHAPTER 5
150
respectively. These results indicate that the ROI should be preferred over the fixed region for
developing the mean flood model.
For the standard deviation model, ROI shows 2% smaller and 9% higher SEP and 2GLSR
values, respectively for NSW. The best result is found for QLD, here ROI shows a 14%
smaller and 12% higher SEP and 2GLSR values, respectively. This indicates that the ROI is
preferable than the fixed region for the standard deviation model. The SEP and 2GLSR
values
for the skew model are the same for the fixed region and ROI for NSW and QLD,
respectively (see Table 21). This can be explained by the fact that the number of sites for the
skew model in the ROI approach was very close to that of the fixed region approach.
Interestingly, one can see from Table 21 that the SEP values for all the flood quanitle models
for NSW, VIC and QLD respectively are 5% to 11%, 6% to 27% and 5% to 13% smaller for
the ROI case than the fixed region one. Also, the 2GLSR
values for ROI case for NSW, VIC
and QLD respectively are 4% to 7%, 2% to 12% and 1% to 5% higher than the fixed region
case. These results show the relative advantage of the ROI approach coupled with BGLSR
over a fixed region BGLSR where further improvements have been achieved overall.
Table 22 shows the number of sites in a region, the associated MEVs and their percentage
(%) differences for the ROI against the fixed region models for NSW, VIC and QLD. This
shows that the ROI mean flood model for all the three states has fewer sites on average (36
out of 96 i.e. 37% of the available sites for NSW, 32% for VIC and 24% for QLD) than the
standard deviation and skew models. The ROI skew model for each state has the highest
number of sites which includes nearly all the sites in the respective states. The MEVs for all
the flood quantile ROI models are smaller than those of the fixed region models with
differences in order of 50% to 60%. This shows that the fixed region models experience a
greater heterogeneity than the ROI. If the fixed region models are made too big, the model
error is likely to be inflated by heterogeneity unaccounted for by the catchment
characteristics predictor variables. Two important points should be noted here that spatial
proximity (physical distance) may become a surrogate for unknown processes in regional
flood frequency analysis (RFFA) and that the catchment characteristics variables available at
the regional scale may not always be sufficient indicators of regional flood behaviour. In fact,
these regional models are too simplistic in their form, predictor variables and data
CHAPTER 5
151
representation; there are lots of lumping and approximations involved along with many
simplistic assumptions. Hence, regional flood models can never be highly accurate within the
current modelling and data regime.
Table 22 Model error variances associated with the fixed region and ROI for NSW, VIC
and QLD (n = number of sites needed for the LP3 parameters and flood quantiles)
State Parameter / ARI
Mean Stdev Skew Q2 Q5 Q10 Q20 Q50 Q100
ROI (n) / 2ˆ
36 0.19
47 0.046
95 0.013
31 0.20
42 0.16
48 0.16
52 0.18
53 0.25
55 0.29
NSW Fixed region (n)
/ 2ˆ 96
0.29 96
0.058 96
0.013 96
0.21 96
0.23 96
0.23 96
0.25 96
0.35 96
0.35
(%) diff in 2ˆ 34% 21% 0% 5% 30% 30% 28% 29% 17%
ROI (n) / 2ˆ
43 0.21
83 0.041
117 0.028
41 0.20
45 0.20
52 0.23
52 0.19
57 0.27
57 0.29
VIC Fixed region (n) /
2ˆ 131 0.29
131 0.044
131 0.034
131 0.27
131 0.29
131 0.35
131 0.35
131 0.47
131 0.59
(%) diff in 2ˆ 28% 7% 18% 26% 31% 34% 46% 43% 51%
ROI (n) / 2ˆ
42 0.15
65 0.056
150 0.014
60 0.14
65 0.08
74 0.07
80 0.07
88 0.10
90 0.12
QLD Fixed region (n) /
2ˆ 172 0.23
172 0.14
172 0.015
172 0.26
172 0.17
172 0.18
172 0.15
172 0.17
172 0.20
(%) diff in 2ˆ 35% 60% 7% 46% 53% 61% 53% 41% 40%
Figure 31 plots the spatial variation of the MEVs (grouped in classes according to numerical
values as specified in the legend) for the mean flood model (Figure 31a) and how the MEV
varies with the number of sites within the ROI, for a typical site (Figure 31b) for the state of
NSW. The plot reveals the relative advantage of the ROI approach. It can be seen that there
are distinct spatial variations illustrating the heterogeneity of the mean flood model that
would be often ignored in a fixed region approach. Similar results were observed in both VIC
and QLD states.
The spatial variation in the model error for the skew model captures the entire study area
mostly (figure not shown) for NSW, VIC and QLD. Similar results were found by
CHAPTER 5
152
Hackelbusch et al. (2009). The significance of this finding is that if any spatial variations
exist in the hydrologic statistic of interest, they are most likely to be captured by the ROI.
Figure 31 Spatial variations of the grouped minimum model error variances for (a) mean flood model
and (b) number of sites which produced the lowest predictive variance for the mean flood model
5.5.4 EVALUATION STATISTICS
An objective assessment of the developed models can be made by using the numerical
evaluation statistics given in Equation (3.45) and Equation (3.44), in which RMSEr is the
relative root mean squared error and REr is the absolute median relative error. The RMSEr is
associated with the predictive error variance, where as REr is related mostly with prediction
bias. Using the model predicted flood quantiles (estimated by QRT and PRT, with fixed and
ROI regions) using the LOO validation, the evaluation statistics were calculated. These are
given in Table 23.
Numerical values of these statistics show the relative advantage of the ROI approach (for
both the QRT and PRT) for all the three states (i.e. NSW, VIC and QLD). The flood quantile
estimates obtained from the fixed regions (QRT and PRT) are more biased (i.e. higher REr)
and are of a lesser accuracy (i.e. higher RMSEr). This is observed for all the three states.
0
45
9050
3540
SITES15202530
100
kilometres
200
Victoria
Australian Capital Territory
New South Wales
kilometres
0 100
MEV = 0.12 - 0.16MEV = 0 - 0.11
MEV = 0.17 - 0.19MEV = 0.20 - 0.21MEV => 0.24
LEGEND
200
Victoria
Australian Capital Territory
New South Wales
(a) (b)
CHAPTER 5
153
Table 23 Evaluation statistics (RMSEr and REr) from LOO validation for NSW (Results
NSW for PRT using the weighted regional average standard deviation and skew models, i.e.
no predictor variables given in brackets), VIC and QLD
NSW
Model RMSEr (%) REr (%) PRT QRT PRT QRT Fixed
region ROI Fixed
region ROI Fixed
region ROI Fixed
region ROI
Q2 73 62 (63)
68 59 46 38 (37)
44 40
Q5 65 54 (59)
70 59 37 30 (32)
38 36
Q10 67 56 (60)
74 55 37 29 (33)
37 36
Q20 72 57 (63)
83 53 36 34 (34)
35 31
Q50 81 70 (77)
100 67 38 34 (35)
36 32
Q100 90 75 (85)
100 72 40 36 (39)
38 35
VIC Q2 56 55 77 68 38 37 37 37 Q5 69 68 87 68 38 36 35 35 Q10 82 80 107 69 37 37 36 35 Q20 96 92 112 74 41 40 38 33 Q50 115 110 113 95 41 40 41 40 Q100 130 127 140 120 46 45 44 44
QLD Q2 82 69 61 56 39 35 39 39 Q5 68 60 48 44 33 34 34 32 Q10 69 60 52 47 34 30 32 31 Q20 72 65 50 44 35 33 31 29 Q50 78 68 53 49 37 36 32 31 Q100 85 79 58 53 41 40 36 31
For the QRT and PRT (fixed region) it can be observed from Table 23 that there is not much
difference in accuracy (RMSEr) for NSW, VIC and QLD states. Indeed, in relation to bias
(REr) both QRT and PRT fixed region models were found to be very similar for the three
states.
For QRT and PRT (ROI region), a similar result was found where there was no notable
difference in accuracy (RMSEr) between the competing models. For the bias (REr), both the
QRT ROI and PRT ROI models achieved very similar values as seen in Table 23. While
Table 23 does show slightly better accuracy and bias for the QRT over PRT, a point needs to
CHAPTER 5
154
be bought out to clarify this result. There is some underlying bias involved with the
validation of the QRT (fixed and ROI) in that the predicted quantiles are being compared to
the quantiles used in the regression analysis as dependent variables. Thus the result mostly
seems to be slightly in favour of the QRT (see Table 23). How to compensate for this bias in
the validation process needs further effort, which has not been done in this thesis. On the
other hand, the validation procedure for the PRT is more stringent in that the parameters of
the distribution are used in the regression and quantiles are then independently estimated and
compared to the at-site flood quantiles. The results from the evaluation statistics therefore
indicate that the PRT is indeed a viable approach for RFFA as an alternative to the
commonly applied QRT method in the ungauged catchment application.
Below the results are presented based on the evaluation statistics (i.e. Equations (3.45 and
3.44)) to compare the flood quantiles from PRT-ROI using a weighted regional average
standard deviation and skew to the PRT-ROI using a standard deviation and skew as a
function of predictor variables for the state of NSW. The evaluation statistics (see Table 23 –
values in the bracket) from the validation reveal that there is no real loss of accuracy (as
compared to at-site flood quantiles) if a weighted regional average standard deviation and
skew model is adopted to estimate the flood quantiles up to the 20 years ARI.
The results at the higher ARIs (50 and 100 years) show that using a weighted regional
average standard deviation and skew may slightly affect the outcome of the analysis (i.e.
lesser accuracy and greater bias). The larger ARI estimation may require further information
which may be provided by having predictor variables (such as catchment area, design rainfall
intensity, forest and mean annual rainfall) for the standard deviation model as found in this
study. This issue deserves further investigation before estimating larger ARI flood quantiles
based on a weighted average standard deviation and skew estimates that do not use any
predictor variables.
The evaluation statistics presented above related to a particular aspect of the model validation
over all the six ARIs for all the three states. Now it is worth looking at the overall
performances of the different models (QRT and PRT, with fixed and ROI regions) based on a
ratio statistics and ‘case score analysis’. The ratio is defined as Qpred/Qobs (i.e. rr) and gives
an indication of the degree of bias (i.e. systematic over- or under estimation), where a value
of 1 indicates good ‘average’ agreement between the Qpred and Qobs. Here Qpred values were
CHAPTER 5
155
obtained from LOO validation (fixed and ROI) using the developed QRT or PRT model. The
distributions of the Qpred/Qobs ratio values for the state of NSW are shown in Figure 32 for 5,
20 and 100 years ARIs. Here, for the 5 years ARI, PRT-ROI shows the best results as the
median ratio is the closest to the line corresponding to Qpred/Qobs = 1 (1-line) and the overall
spread of the ratio values is the smallest. For the 20 years ARI, QRT-ROI median ratio is
closer to the 1-line as compared to the PRT-ROI case; however, the overall spread of the
ratio values for both the QRT-ROI and PRT-ROI is very similar. For the 100 years ARI,
QRT-ROI shows noticeable overestimation and PRT-ROI shows some underestimation as
the median ratio value is located just below the 1-line.
Figure 32 Boxplots of Qpred/Qobs ratios for NSW for QRT and PRT, with fixed and ROI regions
Considering all the three states, a case score analysis of the Qpred/Qobs ratio values is
presented below. The criteria for the case score analysis can be seen in Chapter 3, section 3.9.
The models are assessed based on which one receives the most desirable estimation on
average over all the cases (i.e. 6 ARIs and 399 catchments (in total 2394 cases for each PRT
and QRT), combining NSW, VIC and QLD). Based on the criteria set out in section 3.9,
from the 2394 cases, the QRT and PRT with fixed region produce 1881 and 1829 cases
respectively with a ‘desirable estimation’, which is equivalent to 78% and 76% of the cases
respectively. The QRT and PRT fixed region show that 11% and 13% of cases respectively
have a ‘gross underestimation’. The ‘gross overestimation’ for QRT and PRT fixed region
achieves 11% of the cases each.
CHAPTER 5
156
The QRT-ROI and PRT-ROI methods provide 83% and 80% of cases with a ‘desirable
estimation. The ‘gross underestimation’ is associated with 9% of cases for both the QRT and
PRT, respectively. The ‘gross overestimation’ sites for QRT-ROI and PRT-ROI are 8% and
11% of the cases, respectively. It can be seen that in both the fixed and ROI regions there are
cases where the results do not have a very high degree of accuracy. Such results are typical of
RFFA methods (see Rahman, 2005) and are somewhat as expected due to simplistic nature of
RFFA models, which involve many simplified assumptions. For example, addition of a
greater number of predictor variables and/or use of a complex model form may increase
accuracy marginally, but they are not generally significant as far as practical application of
the RFFA methods is concerned (e.g. see Rahman et al., 1999a). Also, the error in at-site
flood frequency analysis estimates (which is the base case for comparison) needs to be kept
in perspective. While we see improvements in the ROI approach for QRT and PRT, the fact
is that there remain a few cases where estimations are not of high accuracy. This needs
further investigation to identify the reason for such high degree of error, which however, has
not been done in this thesis. On average, however, only modest differences can be found for
the QRT-ROI and PRT-ROI estimates for the majority of the cases (see Table 23).
In looking at the cases where most of the ‘gross overestimation’ and ‘gross underestimation’
happened, it was found that the PRT in some cases under estimated the at-site flood quantles
for the larger ARIs (50 and 100 years). Interestingly, it was also found that the QRT
overestimated in many cases the lower ARI (2 and 5 years) at-site flood quantile. These
results were found for a range of catchments sizes over all the states.
What can be concluded overall from this evaluation is that the PRT does not provide less
accurate estimates than the commonly applied QRT method. In fact, the PRT is a useful way
to check the results from QRT to make sure estimates make sense, especially in the case
where the QRT results may not increase smoothly with ARI.
5.6 SECTION SUMMARY
The main objectives of sections 5.4 and 5.5 were to compare the BGLSR approaches using a
fixed and ROI framework that seeks to minimise the Bayesian model error variance
(predictive uncertainty). For this purpose, data from 452 small to medium sized catchments
in eastern Australia (covering Tasmania, VIC, NSW and QLD states) were used. Prediction
equations were developed for the flood quantiles of ARIs of 2 to 100 years using the QRT
CHAPTER 5
157
and for the first three moments of the LP3 distribution (i.e. PRT). Using a method similar to
forward stepwise regression and adopting a number of statistical selection criteria it was
possible to identify the optimal regression models to use in the ROI approach.
It was found that area and design rainfall intensity were significant predictors for the
estimation of the flood quantiles in these states using QRT, while area, design rainfall
intensity, mean annual evaporation, mean annual rainfall, main stream slope and forest were
relatively significant in the estimation of the second and third parameters of the LP3
distribution. LOO validation indicated that the ROI based on the minimisation of the
predictive uncertainty leads to more efficient and accurate flood quantiles estimates by both
the QRT and PRT. The regression diagnostics revealed that the catchment variables alone
may not pick up all the heterogeneity in the regional model. Both BGLSR QRT-ROI and
BGLSR PRT-ROI showed improvements in regional heterogeneity with an increase in the
average pseudo coefficient of determination and a decrease in the model error variance,
average variance of prediction and the average standard error of prediction.
Both the standardised residual and QQ-plots of the ROI approach satisfied the underlying
regression model assumptions better than the fixed region. It was shown that both BGLSR
QRT-ROI and BGLSR PRT-ROI produce smaller average RMSEr and REr values when
compared to the fixed region regression approach. Based on the evaluation statistics overall it
was found that there are only modest differences between the BGLSR QRT-ROI and BGLSR
PRT-ROI which suggests that the PRT is a viable alternative to QRT in RFFA.
The RFFA methods developed in this study was based on the database available in eastern
Australia. It is expected that availability of a more comprehensive database (in terms of both
quality and quantity) will further improve the predictive performance of both the fixed and
ROI based RFFA methods presented in this study, which however needs to be investigated in
future when such a database is available.
5.7 UNCERTAINTY ESTIMATION FOR NEW SOUTH WALES, VICTORIA,
QUEENSLAND AND TASMANIA IN A ROI-PRT FRAMEWORK
Here, uncertainty in design flood estimation is examined in a BGLSR multivariate normal
distribution framework, in that the posterior variance of each flood statistic (i.e. mean,
standard deviation, and skew) was combined and the correlation structure between statistics
CHAPTER 5
158
was preserved to assess the uncertainty associated with the flood quantiles (see section 3.10,
Equations 3.50 to 3.52 and Figure 4). It should be noted that this method only considers the
uncertainty arising from the estimation of the flood statistics i.e. sampling errors and inter-
site correlation (as mentioned in section 3.5.3 and Equation 3.31). Other uncertainties were
not considered, such as measurement errors and uncertainty about the choice of distribution.
This method was applied to all the six ARIs and selected sites in the study regions for NSW,
VIC, QLD and TAS. As an example, the results are shown for four catchments, 1 from each
of the four states with varying record lengths (i.e. for NSW = 29 years, VIC = 41 years, QLD
= 62 years and TAS = 24 years). Figure 33 plots the 95% confidence bands from the Monte
Carlo simulation with 10,000 simulation runs (and the FLIKE at-site confidence bands) along
with the at-site and regional estimation. Figure 33 shows that the predicted (expected)
quantiles (blue triangles) are generally well matched with the observed at-site FFA estimates
(black circles); however, the result for TAS is not overly good. It is also reassuring to see that
the quantiles increase with increasing ARI. Taking the case of site 203012 for NSW and ARI
= 100 years, the confidence interval ranges from 303 m3/s to 1597 m3/s, which show a rather
medium to large uncertainty. However, the result may not be considered poor as they match
up reasonably well with the FLIKE at-site confidence limit values (409 m3/s to 2513 m3/s).
Overall, the uncertainty bands estimated for the regional approach were larger than the at-site
ones, which is as expected. Reasons for this may be due to the fact the BGLSR model
corrects for sampling variability and that generally there is more uncertainty associated with
regional estimation. Finally, it can also be seen that the uncertainty increases considerably
with increasing ARI. In any case the framework presented here provides a relatively reliable
basis for uncertainty analysis which would be of great benefit for in real world applications.
CHAPTER 5
159
Figure 33 Design flood quantile estimation and confidence limits curves for ARIs of 2 to 100 years
CHAPTER 5
160
5.8 SUMMARY
This chapter has developed and compared flood prediction equations for the states of New
South Wales, Victoria, Queensland and Tasmania (for 6 ARIs, Q2 to Q100). Both fixed
regions and ROI approaches in a QRT and PRT framework were used, where the quantiles
and parameters (i.e. mean, standard deviation and skew) of the LP3 distribution were
regressed against catchment characteristics predictor variables. The BGLSR procedure was
adopted for the estimation of the regression model coefficients. To assess the performances
of the developed prediction equations a LOO validation procedure was adopted. Overall, it
was found that the QRT and PRT-ROI perform very similarly and that the PRT is a viable
alternative for design flood estimation in ungauged catchments. The developed prediction
equations allow for design flood or flood statistics estimates along with its associated
uncertainty (in the form of confidence limits at any ungauged catchment) given the relevant
catchment characteristics data.
CHAPTER 6
161
CHAPTER 6: RESULTS - MODEL VALIDATION USING LOO AND
MCCV
6.1 GENERAL
This chapter presents the results of the comparison of the Leave-one-out (LOO) and Monte
Carlo cross validation (MCCV) techniques in a hydrological regression framework. Both
ordinary least squares (OLSR) and generalised least squares regression (GLSR) are applied
to the experimental and real datasets. This chapter aims to outline the overall advantages and
disadvantages of the proposed methods for model selection and validation.
The basic theory and assumptions associated with the LOO and MCCV both in an OLSR and
GLSR framework have been discussed in Chapter 3.
6.1.1 PUBLICATIONS
A Journal paper (ERA, rank A*) has been accepted regarding this chapter. The Journal paper
can be found in Appendix A. The following is the reference where the paper can be found.
Haddad, K., Rahman, A., Zaman, M. and Shrestha, S. (2013). Applicability of Monte Carlo
Cross Validation Technique for Model Development and Validation Using Generalised Least
Squares Regression. Journal of Hydrology, doi.org/10.1016/j.jhydrol.2012.12.041.
CHAPTER 6
162
6.2 RESULTS
6.2.1 PREDICTORS USED
The summary statistics associated with the predictor variables used in this analysis are
provided in Table 24, while Table 25 presents the correlation between the log-transformed
predictor variables where it can be seen that there is significant collinearity and
multicollinearity between the design rainfall intensities (ranging 0.73 to 0.94), medium
correlation between rain and evap (0.52) and evap and the design rainfall intensities (ranging
0.40 to 0.58) and modest correlation between sden and rain (0.27) and sden and evap (0.36).
Table 24 Summary of predictor variables (here log10 is used)
Predictor variable Minimum Maximum Mean Standard deviation
log(area) (km2) 2.08 6.92 5.43 1.12 log(2I12) (mm/h) 1.29 2.49 1.77 0.30 log(2I1) (mm/h) 2.97 3.91 3.33 0.23
log(50I12) (mm/h) 1.94 3.27 2.46 0.36 log(50I1) (mm/h) 1.62 1.97 1.76 0.10
log(Itc, ARI), ARI = 10-year (mm/h)
1.94 3.58 2.58 0.42
log(Itc,ARI), ARI = 100-year (mm/h)
2.35 3.97 3.02 0.43
log(evap) (mm) 6.89 7.34 7.11 0.10 log(rain) (mm) 6.23 7.58 6.87 0.28
log(sden) (km/km2) -0.66 1.70 0.92 0.47 log (S1085) (m/km) 0 3.91 2.2 0.81 log(forest) (fraction) -4.61 0 -1.01 1.08
CHAPTER 6
163
Table 25 Correlation between the log10 predictor variables used in the analysis
area 2I1 2I12
50I1 50I12 Itc, ARI=10 Itc, ARI=100 rain evap sden S1085 forest
area 1.00
2I1 -0.08 1.00
2I12 -0.09 0.94 1.00
50I1 0.02 0.94 0.88 1.00
50I12 -0.07 0.92 0.97 0.90 1.00
Itc, ARI=10 -0.70 0.73 0.76 0.65 0.75 1.00
Itc, ARI=100 -0.67 0.73 0.75 0.67 0.77 0.99 1.00
rain -0.24 0.68 0.77 0.54 0.71 0.66 0.63 1.00
evap -0.13 0.58 0.53 0.40 0.49 0.43 0.40 0.52 1.00
sden -0.19 0.31 0.30 0.22 0.26 0.29 0.28 0.27 0.36 1.00
S1085 -0.28 -0.09 -0.02 -0.02 0.02 0.17 0.19 -0.07 -0.27 0.07 1.00
forest 0.15 0.20 0.32 0.23 0.34 0.12 0.14 0.27 -0.07 0.20 0.31 1.00
CHAPTER 6
164
6.2.2 SIMULATED DATA
A number of simulation runs were undertaken on different models with varying random
errors. Here, we discuss the simulation based on the model given by Equations (3.68 and
3.69). The results for OLSR are summarised in Tables 26 and 27 while the results for
GLSR are provided in Tables 28 and 29. The summary tables also provide the results for
the analysis based on the true (i.e. true model) MSEP for both the OLSR and GLSR
models.
For the LOO (i.e. for nv = 1), the model tends to include a greater number of predictor
variables than that required as evidenced by the inclusion of many more predictor variables
than those of the higher nv. This particular feature is evident for both the OLSR and GLSR
techniques. As an example, in Tables 26 and 27, for the OLSR LOO (where nv = 1) and for
= 1, x1 is selected only 42% (210/500) of the cases, while for = 0.2, x1 is selected 51%
(253/500) of the cases. The GLSR results also suffer from over fitting (see Tables 28 and
29); however, the chances for selecting the right model do increase with the GLSR. As an
example, for = 0.95, x1 is selected 53% (263/500) of the cases, while for = 0.25, x1 is
selected 64% (318/500) of the cases. Another important aspect of the LOO for both the
OLSR and GLSR that it tends to underestimate the MSEP of the true model and calibration
data set as compared to the higher nv. Figure 34 illustrates this where it can be seen that as
nv increases the MSEP also increases. It is thus evident that LOO lends itself to over fit the
selected regional regression model.
For the MCCV case, when nv = 45, x1 is included 475 and 492 instances for the OLSR
when = 1 and = 0.2, respectively. This gives an MSEP = 3.50 and 1.49 for the CMCCV
case (see Tables 26 and 27) as compared to 491 and 499 instances for the GLSR for =
0.95 and 0.25, respectively and MSEPs = 1.61 and 0.52 (see Tables 28 and 29). For nv = 1
for both the OLSR and GLSR, the MSEPs = 1.68, 0.61, 0.27 and 0.050 (see Tables 26 to
29) which are relatively smaller when compared to the LOO of the calibration data set (i.e.
2.02, 0.77, 0.41 and 0.11). This implies that the LOO, particularly with the OLSR, has a
much higher chance of selecting a larger model (i.e. a model with a higher number of
predictor variables). From Tables 26 to 29 it can be seen that the MSEP values based on the
CHAPTER 6
165
model selected by the LOO are always greater than the true MSEPs (e.g. 53% (i.e. (2.02-
1.87)/1.87) in Table 26 for the OLSR when = 1).
Tables 26 and 27 also reveal that collinearity (i.e. between variables x1 and x2) is more
prominent for the OLSR LOO case especially when the random errors are highly spread
(when = 1). This can be also seen in Figure 34 for nv = 1, where the combined variables
x1 and x2 have relatively closer MSEP values to the variable x1. For the GLSR, the
collinearity is not a major issue for both = 0.95 and = 0.25 (see Tables 28 and 29) and
the varying cross correlation between sites.
For example, x1 and x2 (which are made highly correlated, see Chapter 3, section 3.11.4)
appear in the model many more instances in the OLSR (e.g. 155, 187, ... times in Table 26)
than in the GLSR (e.g. 93, 105, ... times in Table 28). Since the GLSR analysis recognises
the sampling error as a separate component to the total error, it seems that the GLSR can
distinguish very well between the predictor variables in contrast to the OLSR. Since both
sampling error and model error are lumped together, the OLSR model pushes for more
predictor variables to compensate for the higher model uncertainty. From these results, it
can be seen that the GLSR, with a relatively high spread of error (e.g. = 0.95) and
modest correlation between sites, provides reasonable results with the LOO validation as
compared to the OLSR LOO case. Hence, it may be concluded that the LOO is better
suited with the GLSR than with the OLSR in regional hydrologic regression.
From Tables 26 to 29 the following points may be noted. The chance for the MCCV to
select the true model (that includes only x1 as predictor) increases with increasing nv. This
can be observed with both the OLSR and GLSR models; however, the results for the GLSR
are slightly better. Uncertainty is therefore reduced for the model selected by the MCCV
(i.e. decrease in over fitting). What is also noticeable from Tables 26, 27, 28 and 29 is that
as nv increases some of the predictor variable combinations are not selected at all (i.e.
shown as zero in the table). This illustrates that in most cases the MCCV method would
choose the best model. Looking at Figure 34 for nv = 25 and 35, it is evident that both the
OLSR and GLSR MCCV select the predictor variable x1 consistently better than any other
variable. This is especially the case for the GLSR MCCV case as it has the smaller MSEPs.
When the MSEP (i.e. predictive variance) is smaller and when there is medium to high
correlation between sites, the GLSR MCCV should be the preferred option for validation
CHAPTER 6
166
(as evident in Figure 34). The GLSR with modest cross correlation and larger random
errors also provides relatively better results in most cases. In addition, the collinearity
seems to have no major influence in choosing the correct predictor variable for the MCCV
case (see Figure 34, i.e. nv = 15, 25 and 35). Furthermore, the GLSR appears to be the
superior regression approach when the model errors are modest and when there is
reasonable sampling uncertainty from site to site.
In all the cases, the MSEP values for the MCCV depend significantly on nv. From Tables
26 to 29, it is clear that the use of MCCV to estimate the MSEP of the selected model when
nv > 25 may not be appropriate as the MSEP increases modestly (i.e. nv also increases for
the calibration set). In nearly all the cases for the OLSR and GLSR, with the varying
random errors and cross correlations, the MCCV seems to estimate MSEP based on the
selected model with similar level of accuracy to that of the CMCCV (for Equations 3.68
and 3.69). In nearly all the cases, CMCCV stays around the acceptable limits of the true
MSEP up to nv = 25. From Tables 26 to 29, it is observed that CMCCV may be a good
candidate to be used to estimate the prediction ability of the selected model overall, as
CMCCV tends to stay within acceptable limits around the MSEP of the selected model (for
Equations 3.68 and 3.69). Thus, with nv = 15 to 25 (representing 30% to 50% of the
catchments), the MCCV and the CMCCV estimate the MSEP with reasonable accuracy.
Table 26 Results from simulated data, OLSR when 2 = 1
Frequencies of variables being selected
Values of optimal MSEP
Based on Eq.(3.68) Simulated data True model
nv x1 x1, x3
x2, x3
x1, x2
x2 or x3
LOO MCCV CMCCV LOO MCCV TMSEP
1 210 83 17 155 35 1.68 2.02 1.87 15 290 25 0 187 3 2.48 2.38 2.54 20 410 37 0 53 0 2.03 1.94 2.15 25 410 5 0 85 0 1.99 1.91 1.97 30 418 30 0 53 0 2.49 2.40 2.34 35 423 5 0 73 0 2.99 2.89 2.80 40 445 3 0 55 0 3.60 3.47 3.51 45 475 0 0 25 0 3.74 3.50 3.58
CHAPTER 6
167
Figure 34 The mean squared error of prediction (MSEP) associated with LOO and MCCV for OLSR and GLSR simulations
CHAPTER 6
168
Table 27 Results from simulated data, OLSR when 2 = 0.04
Frequencies of variables being selected
Values of optimal MSEP
Based on Eq.(3.68) Simulated data True model
nv x1 x1, x3
x2, x3
x1, x2
x2 or x3
LOO MCCV CMCCV LOO MCCV TMSEP
1 253 48 33 98 68 0.61 0.77 0.74 15 280 25 17 163 16 0.72 0.63 0.76 20 365 60 0 75 0 1.28 1.19 1.39 25 393 55 0 52 0 1.16 1.08 1.27 30 445 25 0 30 0 1.31 1.22 1.26 35 469 15 0 16 0 1.42 1.32 1.38 40 481 8 0 11 0 1.59 1.46 1.51 45 492 3 0 5 0 1.72 1.49 1.65
Table 28 Results from simulated data, GLSR when 2 = 0.903 and )ˆ,ˆ(ˆ ji yy = 0.30
Frequencies of variables being selected
Values of optimal MSEP
Based on Eq.(3.69) Simulated data True model
nv x1 x1, x3
x2, x3
x1, x2
x2 or x3
LOO MCCV CMCCV LOO MCCV TMSEP
1 263 3 88 93 53 0.27 0.41 0.36 15 350 28 15 105 3 0.40 0.30 0.47 20 370 83 35 13 0 0.54 0.45 0.64 25 397 80 23 0 0 0.94 0.86 1.10 30 420 65 5 10 0 1.44 1.35 1.42 35 460 25 5 10 0 1.53 1.43 1.50 40 483 15 0 2 0 1.71 1.58 1.65 45 491 8 0 1 0 1.80 1.61 1.72
Table 29 Results from simulated data, GLSR when 2 = 0.063 and )ˆ,ˆ(ˆ ji yy = 0.70
Frequencies of variables being selected
Values of optimal MSEP
Based on Eq.(3.69) Simulated ata True model
nv x1 x1, x3
x2, x3
x1, x2
x2 or x3
LOO MCCV CMCCV LOO MCCV TMSEP
1 318 68 35 45 35 0.050 0.11 0.095 15 470 0 0 10 20 0.081 0.065 0.088 20 475 10 2 5 8 0.12 0.10 0.136 25 480 10 0 10 0 0.12 0.10 0.132 30 480 15 0 5 0 0.21 0.13 0.18 35 488 5 5 3 0 0.22 0.13 0.20 40 491 3 3 3 0 0.32 0.22 0.27 45 499 1 0 0 0 0.63 0.52 0.54
CHAPTER 6
169
6.2.3 APPLICATION WITH OBSERVED REGIONAL FLOOD DATA IN NSW
Given the 12 predictor variables shown in Table 24, some of them may have minor effects
on the estimation of 10-year and 100-year average recurrence interval (ARI) flood
quantiles (Q10, Q100). In order to select the best set of predictor variables for the regression
models, LOO and MCCV in an OLSR and GLSR frameworks were initially applied for the
calibration data set (60 sites were selected randomly out of the 96 as the calibration data
set). The results are listed in Tables 30 and 31. The optimal OLSR and GLSR LOO both
select three predictor variables. The obtained models along with some summary statistics
are provided in Table 32.
In the MCCV (considering nv = 15, 20, 25 and 30 catchments during the validation and
undertaking 500 simulations), the optimal OLSR and GLSR MCCV each select two
predictor variables as shown in Table 32.
From a goodness-of-fit perspective, it seems that there is no notable difference between
models represented in Table 32 as the coefficients of the regression equations are very
similar and certainly the summary statistics (i.e. R2 / R
2GLSR and standard error of prediction
(SEP (%)) also show some resemblance between the OLSR and GLSR. However, when
comparing the performances of the four different models from Tables 30 and 31 on the
prediction data sets, the differences can be clearly illustrated. Initially from Tables 30 and
31, it can be clearly seen that the GLSR models provide the lowest MSEPs for the LOO
and the MCCV suggesting that the sampling errors have had a relatively notable impact in
the analysis. From Table 30, the OLSR LOO provides an MSEP of 0.11, which is
significantly larger than 0.042, the MSEP based on the OLSR MCCV (for nv = 25). From
Table 31, the GLSR LOO provides an MSEP of 0.092, which is also significantly larger
than 0.016, the MSEP based on the GLSR MCCV case (for nv = 20, 25 and 30).
Tables 30 and 31 clearly indicate that the LOO validation (for both the OLSR and GLSR)
has included one additional/unnecessary predictor variable in the Q10 model. What is also
striking is that for the OLSR and GLSR at nv 20 both select the same predictor variables
even though there was quite a bit of multicollinearity between the potential predictor
variables (as shown in Table 25). This shows that the MCCV is not adversely affected by
multicollineaity and that MCCV would most often provide the best model when significant
CHAPTER 6
170
multicollinearity is present. From Tables 30 and 21, the MSEP values can be considered to
be relatively smaller for both the OLSR and GLSR; this is however more noticeable for the
GLSR, which again reiterates the fact that when the random errors are relatively smaller,
the MCCV is likely to provide the best results for both the OLSR and GLSR cases.
What is noteworthy is the relatively better correction given by the CMCCV to estimate the
MSEP when nv = 20 for both the OLSR and GLSR (see Tables 30 and 31). As nv increases,
the reliability of the CMCCV is also reasonable even though there are fewer sites for model
building and the error for the CMCCV to estimate MSEP may increase a little in this
situation. It is thus found that the MCCV selects a better model (with smaller number of
predictor variables) than the LOO for both the OLSR and GLSR cases. The results in Table
30 and 31 are mostly in agreement with the results from the numerical experiments.
Figure 35 shows the graphical results of the prediction errors (i.e. predicted - observed) of
the predicted flood quantile obtained by the regression equations in Table 32 for the 36
validation catchments against at-site flood frequency estimates. Clearly the prediction
errors are smaller for the GLSR LOO and GLSR MCCV cases. The prediction performance
is better for the MCCV models in both the cases. This shows the typical manifestation of
over fitting often caused by the LOO validation approach. Typically the results look good
for the LOO for the calibration data set; however, when one needs to predict future samples
(i.e. ungauged catchment prediction) MCCV should be used in selecting the optimal
hydrologic regression models. This would lead to less uncertainty in regional flood quantile
estimation. These results also lead one to make the note that the GLSR MCCV provides the
best model and validation procedure as compared to the OLSR.
CHAPTER 6
171
Table 30 OLSR analysis, MSEP values for calibration and validation data set (observed
data from NSW). Here log10 is used
OLSR MSEP on calibration set MSEP on validation set
nv Model variables* LOO MCCV CMCCV Model by
LOO Model by MCCV
1 1, 5,8 0.048 0.11
15 1, 5,7 0.050 0.045 0.048
20 1,5 0.048 0.044 0.041
25 1,5 0.049 0.045 0.042
30 1,5 0.049 0.045 0.042
*Corresponding predictor variables: 1. log(area); 2. log(2I12); 3. log(2I1); 4. log (50I12); 5. log(Itc,ARI), 6. log(evap); 7. log(rain); 8. log(sden); 9. log(S1085); 10. log(forest).
Table 31 GLSR analysis, MSEP values for calibration and validation data set (observed
data from NSW). Here log10 is used
GLSR MSEP on calibration set MSEP on validation set
nv Model variables* LOO MCCV CMCCV Model by
LOO Model by MCCV
1 1, 5,8 0.019 0.092 15 1, 5,6 0.020 0.017 0.021 20 1,5 0.018 0.016 0.016 25 1,5 0.018 0.016 0.016 30 1,5 0.019 0.017 0.016
*Corresponding predictor variables: 1. log(area); 2. log(2I12); 3. log(2I1); 4. log (50I12); 5. log(Itc,ARI), 6. log(evap); 7. log(rain); 8. log(sden); 9. log(S1085); 10. log(forest).
Table 32 OLSR and GLSR analysis for LOO and MCCV for Q10, optimal models shown
along with summary statistics
Regression type/ validation
Regression equation R2 / R2GLSR SEP
(%)
OLSR LOO 2.50 + 1.13log(area) + 1.85log(Itc-10)+
0.07log(sden)
79% 32%
GLSR LOO 2.51 + 1.13log(area) + 1.80log(Itc-10)+
0.05log(sden)
81% 29%
OLSR MCCV 2.49 + 1.14log(area) + 1.88log(Itc-10) 79% 33%
GLSR MCCV 2.51 + 1.13log(area) + 1.82log(Itc-10) 81% 30%
.
CHAPTER 6
172
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0 4 8 12 16 20 24 28 32 36
Site No.
Pre
dict
ion
Err
or
OLSR LOO VALGLSR LOO VALOLSR MCCV VALGLSR MCCV VAL
Figure 35 Prediction error plot for Q10 results (models selected by OLSR and GLSR LOO and models
selected by OLSR and GLSR MCCV)
The MSEP values for the quantile estimate for Q100 are listed in Table 33. Initially, LOO is
carried out on the calibration data set of 60 catchments. The optimal OLSR LOO selects 4
predictor variables in the model, which are log(area), log(Itc_100), log(rain) and log(S1085)
and the optimal GLSR LOO selects 3 predictor variables, which are log(area), log(Itc_100)
and log(rain). The obtained model along with the summary statistics is provided in Table
34.
MCCV was then carried out using both the OLSR and GLSR on the validation data set of
36 catchments. Leaving out 50% of the catchments at a time (i.e. nv = 18) for validation and
performing Monte Carlo simulation 500 times, it is found that the optimal OLSR MCCV
selects 3 predictor variables (log(area) , log(Itc_100) and log(rain)) while the optimal GLSR
MCCV selects 2 predictor variables (log(area) and (Itc_100)). The obtained models along
with the summary statistics are provided in Table 34.
CHAPTER 6
173
Table 33 MSEP for ARI = 100-year
MSEP on calibration set MSEP on test set
LOO MCCV CMCCV Model by
OLSR LOO
Model by OLSR
MCCV 0.069 0.074 0.070 0.12 0.096
LOO MCCV CMCCV Model by
GLSR LOO
Model by GLSR
MCCV 0.045 0.060 0.055 0.090 0.083
Table 34 OLSR and GLSR analysis for LOO and MCCV for Q100, optimal models shown
along with summary statistics
Regression type/
validation
Regression equation R2 /
R2GLSR
SEP
(%)
OLSR LOO 2.97 + 1.07log(area) + 2.07log(Itc_100) -
0.88log(rain) - 0.15log(S185)
71% 30%
GLSR LOO 3.01 + 1.04log(area) + 1.84log(Itc_100) -
0.59log(rain)
70% 26%
OLSR MCCV 2.96 + 1.09log(area) + 2.02log(Itc-100 ) –
0.70log(rain)
70% 32%
GLSR MCCV 3.02 + 1.02log(area) + 1.59log(Itc-100 ) 69% 25%
.
From the comparison of all the regression equations in Table 34 it is evident that the
performances of all these models are very similar, i.e. they all have SEP values within
similar ranges; however, the GLSR Sep values are slightly better. In terms of R2 and R2GLSR,
it can be seen that the OLSR LOO has slightly higher values. What can also be observed is
the number of predictor variables in the OLSR LOO model. Referring to Table 25, it can be
seen that log(Itc_100) and log(rain) are moderately correlated; this may therefore introduce
the problem of over fitting. This result is similar to the result found in the simulation study
where the OLSR LOO tended to include more predictor variables for the true model (see
Tables 26 and 27). Therefore, in the case of prediction ability, the conclusion that OLSR
LOO is the best model due to a higher R2 may be deceptive. In order to confirm this all the
regression equations in Table 34 were finally used to make predictions on the validation
data set of 36 catchments.
CHAPTER 6
174
Figure 36 shows the graphical results from this validation. It is observed that the prediction
performance of the OLSR MCCV, GLSR LOO and GLSR MCCV are all slightly better
than that OLSR LOO; in fact, the GLSR MCCV is the best performer even though it has
only 2 predictor variables and a slightly smaller R2
GLSR. The fact that the GLSR has the
smaller prediction errors and in turn the lower MSEPs (i.e. predictive uncertainties, see
Table 33) actually reduces the need to have more predictor variables in the model. This is
in line with the simulation results where it was found that the GLSR tended to pick the true
model more frequently than the OLSR LOO and OLSR MCCV and more so when MSEPs
were relatively smaller. From Table 33, the MSEP value for the OLSR LOO is 0.12, which
is notably larger than the OLSR MCCV (0.096), GLSR LOO (0.090) and GLSR MCCV
(0.083).
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 4 8 12 16 20 24 28 32 36
Site No.
Pre
dict
ion
Err
or
OLSR LOO VALGLSR LOO VALOLSR MCCV VALGLSR MCCV VAL
Figure 36 Prediction error plot for Q100 results (models selected by OLSR and GLSR LOO and models
selected by OLSR and GLSR MCCV)
Clearly these results indicate that there might be problems with the OLSR LOO model,
which has four predictor variables. Therefore, in this case, the fitting performance appears
to be better upon a first glance; however, these extra predictor variables are unnecessary for
the model. The extra variables in this case can reduce the prediction ability of the model; in
other words, additional uncertainty is introduced into the model by over fitting. An
important fact is that, when estimating the prediction ability of the model, the optimal
CHAPTER 6
175
OLSR LOO and GLSR LOO on the calibration data set seem to both underestimate the
MSEP on the validation data set. This is evident in Table 33 where the MSEP values of the
optimal OLSR LOO and GLSR LOO for the calibration data set are smaller than that of the
validation data set (0.069 < 0.12 and 0.045 < 0.090 for the OLSR LOO and GLSR LOO,
respectively). For the OLSR MCCV and GLSR MCCV, the MSEP of the optimal MCCVs
on the calibration data set are 0.074 and 0.060 respectively, which are also greater than the
MSEPs by the OLSR LOO and GLSR LOO, respectively. This supports the notion that the
MCCV most often would report a better or more accurate estimate of MSEP for the
selected model as compared to the LOO approach. The results in Table 33 and 34 are
mostly in agreement with the simulation results.
6.3 SUMMARY
Selection of the right regression model and estimation of its predictive ability are important
steps in regional hydrologic regression analysis, which are usually undertaken by some
kind of validation. This study assesses the performances of the most commonly adopted
LOO validation with the relatively new MCCV procedure. This analysis is carried out
under the frameworks of OLSR and GLSR for the estimation of flood quantiles. This study
uses a simulated data set and observed regional flood data set from the state of New South
Wales in Australia.
It has been found that when developing regional hydrologic regression models, application
of the GLSR MCCV is likely to result in more parsimonious model than the OLSR LOO,
OLSR MCCV and GLSR LOO cases. The GLSR MCCV has been found to show the
smallest mean squared errors and fewer instances of problems with collinearity as
compared to the OLSR LOO and OLSR MCCV cases. It has also been found that the
MCCV and corrected Monte Carlo cross validation (CMCCV) can provide a more
reasonable estimate of a model’s predictive ability than the LOO. Furthermore, the
CMCCV has the potential to offer reasonable improvement over the MCCV in estimating
the predictive ability of a regional hydrologic regression model.
The findings of this study has some major implications on the way that is usually adopted
in hydrologic regression analysis to estimate regression coefficients using automated
statistical packages, which solely rely on the statistical significances of the regression
CHAPTER 6
176
coefficients in selecting an appropriate regression model. While in some cases the selected
models, developed using statistical packages, seem to be performing well, they may not
perform equally well in the application of the model to the real ungauged catchment case,
as these models have not been extensively validated using a more powerful model
validation technique such as MCCV.
CHAPTER 7
177
CHAPTER 7: BACKGROUND AND DEVELOPMENT OF THE
LARGE FLOOD REGIONALISATION MODEL AND ISSUES
RELATING TO SPATIAL DEPENDENCE
7.1 GENERAL
Firstly, this chapter provides an overview of inter-site dependence in annual maximum
flood series (AMFS) data for Australia. Secondly, the determination of homogenous
regions and the identification of an appropriate probability distribution are discussed in
some detail. A brief outline of the formulation of the heterogeneity measure by Hosking
and Wallis (1993) and the bootstrap Anderson-Darling (AD) test is then given. The
development and calibration of the large flood regionalisation model (LFRM) assuming
spatial independence is derived and discussed. The issues relating to concurrent record
lengths for the establishment of meaningful networks to carry out the analysis of spatial
dependence is presented. The theoretical aspects of inter-site dependence and the
estimation of the number of independent sites (Ne) in regional flood frequency analysis
(RFFA) using a simple model based on the generalised extreme value methodology are also
discussed. Finally, given the limitations of the real data set to give clearly meaningful
results in relation to the derivation of Ne because of issues with sampling variability and
homogeneity, this chapter discusses how synthetic datasets were generated for each of the
regions for use in the analysis.
7.1.1 PUBLICATIONS
A journal paper (ERA, rank B) has been published (details below and full paper in
Appendix A) regarding the initial pilot study undertaken on the LFRM for the states of
New South Wales (NSW) and Victoria (VIC). The work presented in this chapter and
Chapter 8 is an extension to the work presented in the published paper which is based on
the data from all over Australia and a new spatial dependence model for the AMFS data.
Haddad, K., Rahman, A. and Weinmann, P.E. (2011b). Estimation of major floods:
applicability of a simple probabilistic model, Australian Journal of Water Resources, 14
(2), 117-126.
CHAPTER 7
178
7.2 LFRM CONCEPT
The LFRM concept is identical to the basic concept of station-year methods: observed data
from an assumed homogenous region are pooled and a non-parametric flood frequency
curve is fitted on a probability plot. The homogeneity assumption for the LFRM concept is
very similar to that used in the index flood approaches; however the traditional approach in
the index flood approach is to achieve an acceptable degree of homogeneity within the
region by standardising by the at-site mean or median values. In the same spirit the LFRM
is also based on a standardisation by not only taking into account the at-site mean but also
the at-site CV values of the time series data. This unique form of standardisation allows the
pooling of more data from many stations compared to the standard index methods (see
section 2.8.1 for more details). Indeed, it is well known that any station-year method
suffers from problems associated with inter-site dependence (see section 2.2.3). These
issues have been minimised in the LFRM by using an effective number of independent
stations concept, similar to CRC-FORGE (Nandakumar et al., 1997, 2000), as described in
sections 7.8 and 7.9.
7.3 INTER-SITE DEPENDENCE IN GENERAL FOR THE LFRM
The LFRM technique presented by Majone et al. (2007) (called Probabilistic Model) and
further enhanced version by Haddad et al. (2011b) ignore the inter-site dependence
structure of the pooled standardised data, where the highest data point from each station’s
annual maximum flood series (after standardisation) is combined with those from the other
stations in the region to form a database referred to as ‘LFRM data series’. It was assumed
that the individual values in the LFRM data series are independent. This assumption may
be valid if the data being pooled come from stations that are spread over a very large
region. However, examination shows (Figure 37) that values in the LFRM data series used
in this study tend to cluster in some years, with very few events in other years. This appears
to violate the assumption of independent distribution of the events in time and indicates
that some of the events occurring in the same year might have resulted from the same
hydro-meteorological events. However, if the events are separated by at least a few months,
they may be treated as being independent. For example, the same meteorological event may
cause floods in different parts of Australia that are a few weeks apart (since Australia is
CHAPTER 7
179
quite large) and hence these events cannot be treated as independent. However, a separation
period of at least one month may safely be taken as a criterion for meteorological
independence.
Significant inter-site dependence between events in the pooled series of annual maxima
used in RFFA will result in the effective size of the sample being over-estimated, and the
annual exceedance probabilities of given flood magnitude being underestimated. The
testing of the LFRM by Haddad et al. (2011b) has demonstrated that if the Australian
LFRM data series is assumed to be independent, the LFRM tends to underestimate the at-
site flood frequency estimates. It was shown by Haddad et al. (2011b) that 17 out of the 18
test catchments gave an underestimation by 7% to 40%. This result clearly indicates that
the issue of inter-site dependence needs to be addressed for successful application of the
LFRM in Australia. It should be mentioned here that in estimating the inter-site correlation,
the concurrent record lengths were considered i.e. the start and end years were the same for
a pair of stations.
The dependence structure among the concurrent AMFS data of all the possible pairs of
sites, irrespective of their ranks, (these data have been prepared as a part of ARR Project 5)
was examined and it was found that the cross-correlation coefficients are quite high for the
nearby pairs of sites. An example is shown in Figure 38 where two nearby VIC stations
(Stations 221201 and 221207) show a dependence structure (i.e. cross-correlation
coefficient of 0.96). The correlation vs. distance between pairs of stations in VIC is shown
in Figure 39, which indicates that the AMFS data have cross-correlation close to 1 for some
nearby stations, but cross-correlation reduces with distance sharply. Also, high correlation
is a dominant issue only for a limited number of pairs of stations.
CHAPTER 7
180
NSW, QLD, VIC, TAS (e.g. taking one max value from each station)
0
2
4
6
8
10
12
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Year
Q/m
ean
Figure 37 Occurrences of the highest floods – data from NSW, QLD, VIC and TAS are combined (only
the highest value from each station’s AMFS data is taken to form the LFRM data series)
Cross correlation (r) = 0.96
y = 0.7965x ‐ 826.47
R2 = 0.9234
0
2000
4000
6000
8000
10000
12000
0 2000 4000 6000 8000 10000 12000 14000
Q ML/day (Station 221201)
Q M
L/day (Station
221207)
Figure 38 Cross-correlation between two nearby Victorian Stations 221201 and 221207(Considering all
concurrent AMFS data over the period of records – only 21 data points are concurrent for the pair of
stations)
CHAPTER 7
181
The cross-correlation between two stations based on all the concurrent AMFS data has little
relevance to the LFRM model as this model uses up to the rank 5 data i.e. the five highest
flood values from the annual maximum series of each station. Also the degree of
correlation for rare flood events might not be the same as for relatively frequent events. A
viable approach would be to use average cross-correlation considering all the concurrent
AMFS data from all the possible pairs of stations in the database and develop a spatial
dependence model similar to the CRC-FORGE method (Nandakumar et al., 1997). This
model can then be used to account for the spatial dependence in the LFRM data series in
flood quantile estimation using the LFRM.
Another approach might be to examine the start dates of the individual events which
contain the annual maxima for all the sites plotted against the same year (e.g. as in Figure
37); if the starts of the events are a few months apart from each other they may be treated
as independent. If they have resulted from the events which have occurred on the same day
or week, only one data point from these can be retained to establish an independent series.
Here, if the stations are far away (e.g. one station from VIC and another from Queensland
(QLD) they may be treated as independent, although plotted against the same year; they are
most likely to have resulted from different hydro-meteorological events. This approach
requires the examination of the distances between pairs of stations and the start and end
dates of the individual events. While it is quite possible to do this, it would certainly
require more effort (extra programming) which is time demanding.
Any significant degree of dependence between the events in a regional sample reduces the
effective sample size drastically, so the most productive approach might be to establish
essentially independent networks of stations (perhaps by using the concept of de-
correlation distance as an indicator) and then only pool the maxima from such a network of
stations. Some form of constrained random sampling will need to be used to establish a
number of alternative networks of independent stations (see sections 7.8 and 7.9 for further
details).
CHAPTER 7
182
Figure 39 Relationship between the cross-correlations among AMFS data and distance between pairs
of stations in Victoria
7.4 ANNUAL MAXIMUM DATA SET USED IN THE LFRM
As mentioned in Chapter 4, 682 gauging stations are available in Australia that have
reasonable record lengths (19 to 96 years) and are suitable for RFFA analysis. One does
expect that the useful information for RFFA increases with the increasing number of
stations in the region; however, the net information does not increase proportionally with
the increasing number of stations within a given region, due to spatial dependence between
data at gauging stations. While the shorter record lengths in this study (< 25 years) would
introduce notable uncertainty in parameter estimation, they were included as they still
contain useful additional information for the pooled data set. However uncertainty will be
introduced from the errors in the standardisation of parameters. From the 682 stations
shown in Figure 16, two datasets for the LFRM were established: (a) From the 682 stations,
626 stations were selected that had a reasonable concurrent record length. (b) From the
remaining 56 stations, 28 stations were randomly sampled and put aside for testing and
validation with the LFRM. The selected 28 sites are shown in Figure 40.
VIC
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700
Distance Between Stations i and j (km)
Inte
r-si
te C
orre
latio
n
Inter-site Correlation Estimated Correlation
equation of line:
correlation = 0.98 (dij/(0.009*dij+(1))
CHAPTER 7
183
Figure 40 Geographical distribution of the 28 validation catchments for the LFRM
7.4.1 QUALITY CHECK OF THE LARGEST ANNUAL MAXIMA DATA
Any RFFA involves processing of a large amount of data; hence there is a bigger chance
for data errors to go unattended. Also it must be remembered that AMFS data has large
errors associated with the highest flows in the data set because of the nature of the rating
curve extrapolation errors. As discussed in section 7.3, the LFRM uses the largest 1 to 5
observed maxima values from each station in the region. Therefore, any errors in these
observations can introduce significant error into the LFRM final quantile estimates. As
discussed in sections 4.4.3, 4.4.4 and 4.5.3, a rating ratio concept was introduced and used
to cull stations with significant rating curve error. It should be noted here that the adopted
number of data points (the five largest) to be selected from each station in the LFRM has
no solid theoretical justification; however, it is evident that the number should be large
enough to make use of the information from the highest flood events in the region, and
hence the choice of ‘the five largest’ seems to be acceptable. Detailed sensitivity analysis
would be required to allow the selection of an optimum number of data points.
CHAPTER 7
184
7.5 IDENTIFICATION OF AN APPROPRIATE PROBABILITYY
DISTRIBUTION AND TESTING FOR HOMOGENITY OF ANNUAL
MAXIMA FLOOD DATA
In this section, the most appropriate flood frequency probability distribution and the
homogeneity for the Australian data set are examined in the context of the application of
the LFRM technique.
7.5.1 SEARCHING FOR AN APPROPRIATE PROBABILITY DISTRIBUTION
As shown by Majone et al. (2007) and Haddad et al. (2011b), the LFRM concept is
primarily non-parametric and therefore an assumption regarding a particular distribution is
not required. However, it will be shown in section 7.9 that a probability distribution is
fitted to the annual maxima in order to derive a generic relationship for the effective
number of stations (Ne, which is used to adjust the plotting position of the LFRM points).
It can be clearly seen in the literature that the generalised extreme value (GEV) has been
widely used and recommended to describe RFFA extreme data (e.g. see section 2.8.1). The
GEV distribution fitted using the regional L-moment approach has been shown to be
computationally simple. The L-moments are analogous to the conventional moments;
however, they have several theoretical advantages, e.g. being able to model a wider range
of distributions and when estimated from a sample they tend to be more robust to the
presence of outliers in the dataset (Hosking and Wallis, 1997).
In the literature there are many techniques to evaluate distributional assumptions (e.g.
Hosking, 1990; Chowdury et al. 1991; Laio et al. 2009 and Haddad and Rahman, 2011, see
also section 2.2.4 for more references). By using a range of methods (as follow) from the
above literature it was found that the GEV distribution is quite appropriate to approximate
the annual maximum floods in Australia on a state by state basis: (i) the L-moment diagram
and L-moment goodness-of-fit test (i.e. DISTZ ), (ii) the AD goodness-of-fit Monte Carlo test
(the details relating to these goodness-of-fit testes are provided in Appendix D) and (iii)
frequency plots of the fitted and observed data based on L-moments.
With the DISTZ test a fit is declared adequate if DISTZ is sufficiently close to zero, a
reasonable criterion being DISTZ 1.64. The AD test results are reported as P-values for a
significance level of 5%. Hence a value of P > 0.95 suggests that the particular distribution
as the parent is not significant / unsupported.
CHAPTER 7
185
7.5.2 GOODNESS-OF-FIT TEST RESULTS
Figure 41 shows the LSK vs. LKT plots for the annual maximum flood data for the state of
NSW and QLD (how the average values for the states were obtained can be seen in
Appendix D). The plots also illustrate that the theoretical curves for some common 2 and 3-
parameter distributions (normal (NORM), log-normal 2 (LN), gamma (GAM), extreme
value type 1 (EV1), uniform (UNIF), GEV, Pearson 3 (P3), generalised logistic (GLO),
generalised pareto (GPA) and the log-normal 3 (LN3)). The rest of the L-Skewness vs. L
Kurtosis plots for the other states can be seen in Appendix C.
From Figure 41 it is evident that the distributions of annual maximum flood series data for
NSW and QLD are from different parent distributions (the regional LSK vs. LKT (red
colour) both fall on different theoretical curves i.e. GPA and P3, respectively). This
difference was seen for all the states where the annual maxima cannot be fully described by
one single distribution. The summarised results are shown in Table 35 which shows a
mixed result. It can be seen that the L-moment diagrams sometime provide different
outcomes as those of the DISTZ and AD tests, making it harder to determine a single
outstanding distribution. For example, if we look at the results for TAS, it is observed that
both the L-moment diagram and DISTZ test select the P3 distribution; in contrast, the AD
test selects the GLO distribution. The difference in results may be attributed to sampling
variability and the fact that different tests examine different aspects of the goodness-of-fit
of a candidate distribution. However, what does stand out from Table 35 is the distributions
that are selected most by all the tests are the GPA, GEV and P3. To make a more informed
conclusion all these 3 distributions (i.e. GPA, GEV and P3) are fitted and plotted and
superimposed on the standardised data (data from individual sites standardised by the mean
(as for the index method)) for each state and they were then visually inspected. Figure 42
illustrates these plots for WA and TAS. Sample plots are also shown for NSW and VIC in
Appendix C.
CHAPTER 7
186
NSW
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
QLD
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
Figure 41 L-moment ratio diagrams of annual maximum flood data for NSW and QLD
CHAPTER 7
187
Table 35 Summary of goodness-of-fit tests for determining parent distribution
DISTZ AD State/
Distribution GLO GEV LN3 P3 GPA
L-
moment
diagram
GLO GEV LN3 P3 GPA
NSW 9.4 7.2 3.9 -1.8 0.21 GPA 1.0 1.0 1.0 0.98 0.78
QLD 12.6 9.8 6.0 -0.4 0.94 P3 1.0 1.0 1.0 0.32 0.92
VIC 10.3 6.3 3.4 -1.5 4.3 P3 1.0 1.0 1.0 0.97 1.0
TAS 6.6 3.74 2.7 0.8 -3.0 P3 0.88 1.0 1.0 1.0 1.0
WA 2.4 0.22 -2.9 -8.2 -6.5 GEV 1.0 1.0 1.0 1.0 1.0
NT 3.0 0.6 -0.9 -3.6 5.5 GEV 1.0 1.0 1.0 1.0 1.0
SA 7.0 4.8 3.4 0.9 -0.9 GPA 1.0 1.0 1.0 0.90 0.89
Based on the visual inspection of Figure 42 and the figures in Appendix C, the GEV and P3
appear to be good candidates to describe the AMFS data for the different Australian states.
While all the distributions fit the lower end quite well, the GEV and P3 distributions seem
to capture the higher flows much better than the GPA distribution. Given that the LFRM
uses the top 5 maxima it would be far more better to go with a distribution that can
extrapolate relatively well into the higher flow range without showing too much bias in the
extrapolation. Finally, based on the GEV and P3 distributions, the median relative error
(MRE) (MRE = (fitted-observed)/observed)) in percentage was calculated for each of the
states to determine if the fitted distribution under/over estimated the observed value. Table
36 summarises the MRE values for the different states. Table 36 suggests that the GEV
distribution provides the minimum bias as compared to P3 for most of the states. While
there is not a big difference in the results (e.g. NSW), it does provide some guidance along
with the other results on choosing a distribution for use with the LFRM. Hence, based on
all these results it can be argued that the GEV distribution can be taken as the best-fit
distribution in this application.
CHAPTER 7
188
Table 36 Summary of MRE associated with the GEV and P3 distributions
Median relative error (%)
State / Distribution GEV P3
NSW 0 -1
QLD -1 0
VIC -1 -2
TAS 0 -10
WA 0 -8
NT -0.3 15%
SA -1 0
CHAPTER 7
189
0
4
8
12
16
20
1 10 100 1000 10000
ARI (Years)
Sta
ndar
dise
d da
ta
Observed Data
GEV (WA)
GPA (WA)
P3 (WA)
0
2
4
6
8
10
1 10 100 1000 10000
ARI (Years)
Sta
ndar
dise
d da
ta
Observed Data
GEV (TAS)
GPA (TAS)
P3 (TAS)
Figure 42 Visual inspection of distributional fit for GEV, GPA and P3 distributions for WA and TAS
CHAPTER 7
190
7.6 HOMOGENEITY
7.6.1 HOMOGENEITY TEST OF HOSKING AND WALLIS
In identifying homogenous groups from a large number of sites, a balance needs to be
maintained between selecting a reasonable sized group with more information, but being of
a lower degree of homogeneity, and a smaller group with reduced information and showing
a greater degree of homogeneity. The aim of this balancing act is to make the best use of
the information available between group size and the degree of homogeneity being
achieved. It still remains however that a small group that shows good homogeneity may not
be appropriate for use in a RFFA study such as the LFRM as small groups may not be able
to provide statistically meaningful results.
Also, the heterogeneity measure of Hosking and Wallis (1993) H statistic used here (as
explained later) to measure homogeneity has a tendency to give false impressions of
homogeneity for small regions (further discussions on this can be seen in Hosking and
Wallis, 1993). Nonetheless, there does not appear to have any strict rules or guidelines on a
minimum number of sites required to define a homogenous group or region. What is worth
mentioning is that homogeneity assumption is often used explicitly with the index flood
and similar methods. A number of sites forming a homogenous group mean that the
underlying probability distribution of the standardised flood variables for the sites is the
same allowing for sampling variability, which implies that the standardised annual
maximum flood series for the sites are samples from the same population. Given that this
LFRM study is largely based on the station-year method and that the LFRM makes use of
the top five maxima from each site in the region, homogeneity may not be a strict
prerequisite here. However, having homogenous region is advantageous, as this would
certainly reduce the model error inherent in the regional model and would give more
accurate flood estimates applicable to the region of interest. In this section, two
homogeneity tests are applied i) the heterogeneity measure of Hosking and Wallis (1993)
and ii) the bootstrap AD test. A brief explanation on each of these tests is given below
followed by the results of each method applied to each state of Australia. The details
relating to the homogeneity test of Hosking and Wallis (1993) are provided in Appendix D.
CHAPTER 7
191
7.6.2 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST
A test that does not make any assumptions on the parent distribution is the AD rank test
(Scholz and Stephens, 1987). The AD test is the generalisation of the classical Anderson-
Darling goodness-of-fit test (e.g. D’Agostino and Stephens, 1986), and it is used to test the
hypothesis that k independent samples belong to the same population without specifying
their common distribution function. The details relating to the homogeneity test based on
the AD statistic are provided in Appendix D.
7.6.3 TESTING FOR HOMOGENEITY – RESULTS
The method proposed by Hosking and Wallis (1993) and the approach based on the
bootstrap AD test (D’Agostino and Stephens, 1986 and Laio, 2004) as discussed above,
were used to measure the degree of heterogeneity in each Australian state. In applying the
procedure, a simulation of 1000 homogenous regions (i.e. Nsim = 1000), with heterogeneity
measures H(1), H(2) and H(3) being computed using a FORTRAN program developed by
Hosking (1991a). The heterogeneity measure AD was calculated using the nsRFA software
in the “R environment statistical software” package.
It was hypothesised that the different states of Australia are separate region and the testing
was carried out based on this particular assumption. For this, the obtained H and AD values
are given in Table 37.
Table 37 Summary of heterogeneity measures for the Australia states
Heterogeneity measures
State H(1) H(2) H(3) AD
NSW 14 10 5.7 1.0
QLD 17 13 8.2 1.0
VIC 22 14 7.6 1.0
TAS 26 10.5 4.6 1.0
WA 21 12 5 1.0
NT 9 6.1 4.9 1.0
SA 11 8 3.5 1.0
CHAPTER 7
192
From Table 37 it can be clearly seen that all the states are “definitely heterogeneous” as all
the H statistics are much greater than 2. The AD test ( AD ) also supports this result with all
the P-values being 1.0, which shows homogeneity not to be significant at a test significance
level of 5%. While there were discordant sites in the analysis, they were not removed as the
successful development of the LFRM depends on a large number of sites and as such these
sites may contain useful information which may capture significant regional variability
which is required for the LFRM. One important aspect that is kept in mind with the results
obtained in Table 37 is that Australian hydrology is quite variable even from state to state,
and that catchments even within close proximity to each other have quite different physical,
topographical and meteorological features, hence obtaining homogenous regions is quite
difficult. Similar results have been found in previous studies concerning Australian flood
data (e.g. Bates et al. 1998; Rahman et al. 1999; Haddad, 2008 and Ishak et al. 2011).
Unfortunately the two homogeneity tests referred to above may be of limited relevance to
the estimation of large to rare floods. A point should be highlighted to clarify this
statement. The tests referred to above are based on the overall fit of a distribution at
different sites (sample statistics tests). In contrast to other types of index flood methods
which use all the AMFS data available, the LFRM concept uses the 5 highest standardised
values to derive the regional growth curve. Therefore, the tests used here give little direct
information on the homogeneity of these data points used to fit the upper right-hand tail of
a regional distribution. In relation to homogeneity for the purpose of this analysis it was
found that there is insufficient evidence to reject the assumption of homogeneity of the
largest values in the regional sample. The Lu and Stedinger (1992) test maybe used that
looks at the annual exceedence probability 1 in 10 quantiles determined from a GEV
distribution to assess homogeneity in terms of the upper tail of the distribution. However
this was not applied in this study.
7.7 DEVELOPMENT OF THE LFRM MODEL FOR AUSTRALIAN FLOOD
DATA
The LFRM model allows for the estimation of large to rare flood quantiles for any site in a
region by exploiting flood data from other gauged sites in the region. The LFRM is based
on the assumption that the standardised maximum values of the annual maximum flood
series from a large number of individual sites in a region can be pooled (after standardising
to allow for the across-sites variations in the mean and CV values of the annual maximum
CHAPTER 7
193
floods) (Majone et al., 2007). The particular advantage of the LFRM is that, in contrast to
the commonly applied “index flood method”, it does not assume a constant CV across the
sites. This feature, in particular, allows the LFRM to pool data more effectively over a very
large region to allow estimation of large floods. An advantage of the LFRM proposed here
is that it offers an alternative to traditional approaches of large flood estimation methods
based on rainfall runoff models where time and resource constraints may not permit the
development of detailed rainfall based methods. Moreover, there is no guarantee that
rainfall –based methods provide the best possible estimates.
The main focus of the next few sections is to further develop the LFRM by (i) coupling it
with a spatial dependence model that reflects the reduction in the net information available
in regional analysis using spatially dependent data (Nandakumar et al., 2000); (ii) pooling
more data by taking the top 3-5 maximum values in a region; and (iii) combining it with
BGLSR and the region of influence (ROI) approach to develop regional prediction
equations so that the LFRM can be applied to ungauged catchments. Points i, ii and iii are
in essence the main innovations of the LFRM model being presented in this chapter and
Chapter 8 of the thesis.
7.7.1 DEVELOPMENT AND CALIBRATION OF THE LFRM MODEL
The selected Qmax(1,3 and 5) (i.e. the top 1, 3 and 5 maximum data points from each station’s
AMFS data, referred to as Qmax), are first standardised by the at-site average of the AMFS
data (mean), and then plotted in the (CV, Qmax/mean) plane. Figure 43 shows such a plot
for the study data set consisting of 626 data points (1 max), 1878 data points (3 max) and
3130 data points (5 max) from 626 sites, which suggests the following relationship:
CVcmeanQ /max (7.1)
The coefficients (c, and ) of Equation 7.1 were estimated by the maximum likelihood
approach for each of the plots in Figure 43. The estimated coefficients along with their R2
values are provided in Table 38.
CHAPTER 7
194
Table 38 Coefficients of non linear interpolation from Figure 43
Max
(Number of highest data points from the
AMFS) c R2 (%)
1 1 3.25 1.37 87
3 1 2.34 1.18 75
5 1 1.85 1.03 71
The R2 values in Table 38 suggest that the estimated coefficients provide a reasonably good
fit to the experimental data; this is more evident however when pooling the top 1 AMFS.
When pooling 3 and 5 top maxima, a greater scatter is noticed as can be seen in Figure 43;
this is also supported by the drop in R2 value. An important note is made here whether the
weaker relationship with CV is compensated for later on by having additional data points to
define the lower end of the distribution. What can be observed from Table 38 is that the
exponent is appreciably greater than unity (as would be the case for a Gumbe1
distribution for 1 max and 3 max) and decreases notably with the pooling of more data.
CHAPTER 7
195
2.52.01.51.00.50.0
14
12
10
8
6
4
2
0
CV(Q)
Qm
ax/m
ean
2.52.01.51.00.50.0
12
10
8
6
4
2
0
CV(Q)
Qm
ax/m
ean
2.52.01.51.00.50.0
12
10
8
6
4
2
0
CV(Q)
Qm
ax/m
ean
Max of 1
Max of 3
Max of 5
Figure 43 Scatter of Qmax/mean data in the (CV(Q), Qmax/mean) plane and non linear interpolation
function
Based on Figure 43 and assuming that a large part of the scatter can be explained by
variations in the average recurrence interval (ARI) of the AMFS data, the best way to
model the scatter is to search for a LFRM function in the form of:
CVARIfcmeanQ )(/max (7.2)
where it is assumed that )(ARIf is a function of the ARI only and can be substituted for the
coefficient . From Equation 7.2, the calibration procedure is based on the introduction of
a new standardised variable which can be defined by:
CHAPTER 7
196
CV
cmeanQY
)/( maxmax
(7.3)
where c and are based on the coefficients according to the number of annual maxima
pooled (e.g. 1, 3 or 5)
This form of standardisation (Equation 7.3) takes into account not only of differences in the
mean values but also of the CV, raised to the power appropriate for a specific regional data
set. As expected, as a result of this new standardisation, Ymax is practically uncorrelated
with the coefficient of variation, as is confirmed by the very small R2 values seen in the
plot of Figure 44 referring to the same set of data points for using the top 1 and 5 annual
maxima.
1 Max
y = -0.037CV(Q) + 3.28
R2 = 0.0003
-0.5
0.5
1.5
2.5
3.5
4.5
5.5
6.5
0 0.5 1 1.5 2 2.5 3
CV(Q)
Ym
ax
CHAPTER 7
197
5 Max
y = 0.161CV(Q) + 1.67
R2 = 0.0037
-0.5
0.5
1.5
2.5
3.5
4.5
5.5
6.5
0 0.5 1 1.5 2 2.5 3
CV(Q)
Ymax
Figure 44 Scattering of Ymax data in the (CV(Q), Ymax) plane and linear interpolation function for the
pooling of 1 (1 max) and 5 (5 max) top maxima
The following plotting position formula (Equations 7.4, 7.5 and 7.6), proposed by Majone
and Tomirotti (2004) was applied to estimate the ARI or the empirical non-exceedance
frequency of each of the Ymax values in the pooled data sets (i.e. max of 1, 3 and 5) from the
N = 626 sites.
In order to define the form of the distribution of the variable Ymax, the top 1, 3 and 5 annual
maxima values of each site’s data were used. Here the major assumption made is that the i-
th value of the series is independent of the others and that the normalised values obtained
applying Equation 7.3 (after standardising by the mean and coefficient of variation) belong
to the same population, and hence the plotting position of the Ymax can be provided by the
following empirical equation (Majone and Tomoirotti, 2004):
an
YYyPyP )()(ˆ (7.4)
where Equation (7.4) gives the probability of the maximum Y in a set (y) as the probability
of the individual observations Y in the set raised to the power of the number of
observations in the set, and na denotes the average sample size of all the at-site AMFS data
utilised in the analysis (e.g. na 34 for this study).
CHAPTER 7
198
Now, sorting the sample of N = 626, 1878, 3130 (based on the number of annual maxima
being pooled) and pooling the normalised values of Ymax in decreasing order, the value y
corresponding to ARI (return period, T years) has the following position (or rank) m in the
ordered sample:
])/11(1[)](1[)](1[ ˆaa nn
YYTNyPNyPNm (7.5)
For easier interpretation in terms of ARI, this can be rewritten as:
))/1(1/(1 /1 anNmARI (7.6)
where m is the rank of the observation in the pooled N, 3N or 5N Ymax data (i.e. 626, 1878
or 3130 data points), na is the average sample size and N the number of sites (assumed to be
independent in terms of maximum observed floods). From this definition, the estimated
ARI values may ideally be assumed to be representative of actual return periods. However,
this may not be the case for the Australian flood data set as many of the gauging sites are
very close together spatially and temporally (see Figures 16 and 37) and hence there would
be significant inter-site dependence within these observed AMFS data. Sections 7.8 and 7.9
look at this issue in more detail with the development of a spatial dependence model to
correct for the effective number of sites (Ne) which is currently assumed/incorporated in
Equations 7.4, 7.5 and 7.6.
.
The plot of Ymax vs. YT (where YT is the Gumbel reduced variate and is used as a surrogate
for ARI, where YT = -ln[-ln(1-1/T)]. A table of Gumbel variate values corresponding to
ARIs is given in Appendix D) is shown in Figure 45 for Ymax (N, 3N and 5N) sites. The
plots for 3N and 5N sites in Figure 45 are in line with what would be expected from using
the additional data points. Clearly the impact of using a greater number of maxima, e.g. 5
maxima, seems to provide a very smooth empirical distribution that is fitted closely by the
distribution function. These plots also reveal that the experimental data can be
approximated by a second degree polynomial function of YT as given by Equation 7.7,
whose model coefficients and R2 values can be seen in Table 39 for the different pooling of
the annual maxima (i.e. top 1, 3 and 5 maxima):
CHAPTER 7
199
32
2
1max )()( CYCYCY TT (7.7)
which in terms of Qmax/mean takes the following form:
CVCYCYCcmeanQ TT ))()((/ 32
2
1max (7.8)
Equations 7.7 and 7.8 yield the analytical expression of the LFRM model for the study data
set using the top 1, 3 and 5 annual maxima, where the appropriate values of the coefficients
in Table 39 are substituted into Equations 7.7 and 7.8. However, this formulation does not
allow for the effect of the inter-site dependence which in essence reduces the net
information available in any regional analysis (Nandakumar et al. 1997 and 2000). This can
be accounted for through the use of a spatial dependence model. The basic theory of inter-
site dependence and determining inter-site dependence are provided in this chapter (the
next few sections) while the development of a general spatial dependence model is
discussed and presented in Chapter 8.
Table 39 Coefficients and R2 values of Ymax polynomial interpolating from Figure 45
N sites - Ymax C1 C2 C3 R2 (%)
1 -0.027 0.80 0.49 0.997
3 -0.041 0.98 -0.18 0.998
5 -0.044 1.07 -0.59 0.999
CHAPTER 7
200
Figure 45 Frequency distribution of the standardised Ymax values
CHAPTER 7
201
7.8 EFFECTS OF INTER-SITE DEPENDENCE ON THE LFRM MODEL
This section presents the effects of inter-site dependence on RFFA in general; however, the
major aim is to develop a spatial dependence model to be used in application with the
LFRM concept being applied in Chapter 8 of the thesis.
As stated in sections 7.3 and 7.7.1, spatial dependence in AMFS data reduces the net
information available in any RFFA data set. Accordingly, the presence of spatial
dependence results in biased quantile estimates, because of the reduced number of
independent stations when the effects of inter-site dependence are considered.
This section begins with a brief introduction of the effective number of independent
stations concept (Ne). The estimation methods of Ne are then described. Finally, models for
the estimation of Ne are developed. The application of Ne with the LFRM using a
comprehensive Australian AMFS dataset is provided in Chapter 8.
7.8.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS
The introduction of the effective number of independent stations concept has been
introduced to quantify the effects of inter-site dependence (or also called spatial
correlation) on regional estimates of flood frequency distribution parameters. The value of
Ne depends on which specific distributional parameter is being estimated. The estimation of
Ne is usually based on two broad approaches (i) methods that use some form of regional
average parameters; and (ii) methods that pool annual maxima data. In the following
sections, approach (ii) is discussed in more detail; however, further information based on
approach (i) can be read in Alexander (1954); Stedinger (1983); Hosking and Wallis (1988)
and Nandakumar et al. (1997).
In the RFFA approaches which consider pooling of the standardised AMFS data from
several sites, time sampling is assumed to be substituted by space sampling. If the spatial
data were independent, each maximum value in the pooled data set could be assigned a
plotting position computed from the aggregated period of the record (the total record
length: L = Nna). This is often referred to as the typical “station-year method”. However,
the effective record length (Le) is invariability smaller than the total number of AMFS data
points in the pooled database because of the presence of spatial correlation.
CHAPTER 7
202
The effective record length of the pooled data set determines the position of the observed
annual maxima on a probability plot i.e. the associated frequency/ARI. Thus, the effective
number of stations for this approach can be defined such that Ne independent stations
should provide the same record length as N spatially dependent stations. Thus, Ne is defined
as the ratio of Le and the average record length over all the stations ( an ).
a
ee
n
LN (7.9)
As Ne determines the position of a data point (in the pooled annual maxima data set) on a
probability plot, any error in this measure of spatial dependence in the AMFS data would
introduce a bias in the final flood quantile estimates.
7.8.2 REGIONAL MAXIMUM FLOOD AT A NETWORK OF SITES - REGIONAL
MAXIMUM AND TYPICAL CURVES
This section begins with the analysis of the AMFS data observed at one or more networks
of sites. A network corresponds to gauged sites; however, in application, the network can
be any group of sites for which a large flood estimate is sought.
Let us visualise a hypothetical network consisting of four sites, all with the same period of
records (i.e. satisfying the concurrent record length criterion). The maximum flood data
points from each of these four sites can be pooled to form a series of the “maximum of 4”.
The data points should be standardised (e.g. by Equation 7.3) before the largest values are
picked so as to give each of the sites an equal chance of providing the maximum values to
the new maxima series.
Once the annual maxima series is constructed, statistical techniques such as L-moments are
used to fit a “regional maximum of 4” flood growth curve. Dales and Reed (1989) state “it
is difficult to devise a terminology that encapsulates the general meaning of these growth
curves without being clumsy”. The typical curve is an average standardised point flood
growth curve for a particular geographical region, which is produced by averaging the
parameters of the ditributions fitted to individual sites. The regional maximum curve is a
standardised flood growth curve associated with the maximum flood experienced at a
CHAPTER 7
203
network of N sites located within a geographical region. The term “regional maximum” is
used to highlight that the ‘maximum data series used here’ is over space rather than time
(see section 7.8.1 for more details) and it can also be thought of as “network maximum”.
However, as will be shown in the later sections it is of interest to consider generalised
networks of sites within a given geographical region and it was for this reason that the
terminology “regional maximum” was finally adopted by Dales and Reed (1989). Further
information can be read in Dales and Reed (1989).
7.8.3 FACTORS INFLUENCING THE REGIONAL MAXIMUM
The regional maximum growth curve as defined above for N sites (N > 1) is expected to lie
above the typical regional growth curve. However, there is an exception when sites are
closely grouped together such is the case when there is perfect correlation between the
annual maxima of the individual sites. The position of the regional maximum growth curve
in relation to the typical growth curve is influenced by the number of sites in the region in
question, the scattering of the sites and by the system inputs (e.g. rainfall, baseflow,
evaporation and other meteorological factors) and outputs and physicality of the
catchments in the region. Many classifications can be used to gain an understanding of the
regional maximum curve in relation to the typical curve, in this study the major influences
are indexed by the number of sites N, the region being analysed and the average correlation
coefficient between the number of sites in a region or network.
7.8.4 NUMBER OF SITES, N
For a given network, within a region, the magnitude of the regional maximum growth
curve would clearly depend on the number of sites, N, from which it is drawn. For
example, 8 sites in a network of the Australian AMFS dataset seem to capture a reasonable
average concurrent record length (i.e. 18 years) as shown in Figure 46. As the network size
N increases e.g. N = 32, the average concurrent record length decreases, therefore not
picking up the required variations from site to site in the network, which makes it
unsuitable to derive a suitable regional maximum growth curve. The “regional maximum of
8” growth curve would therefore lie above the “regional maximum of 4” growth curve for a
given network within a specific region. Hence, the maximum network values used in this
study are taken to be N = 2, 4 and 8 to define the regional maximum and typical growth
curves.It should be noted the above comments are relevant to the proposed methodology
for deriving the regional maximum and typical curves. For application of the LFRM there
CHAPTER 7
204
is no need for the series at different sites to be concurrent, as long as the assumption of
stationarity is satisfied.
7.8.5 CROSS CORRELATION
The position of the regional maximum growth curve in relation to the typical growth curve
is governed by the degree of cross correlation between the individual site’s AMFS data.
This cross correlation may be highly variable for different paired sites/gauges, and hence
dependence between sites can be seen in terms of an inter-site correlation-distance
relationship. Figure 39 shows this sort of relationship for 131 gauging sites in the state of
VIC. It can be seen that there are significant correlations even at greater distances for the
VIC data, which implies that there is indeed notable spatial dependence present. This was
observed for all the states of Australia.
Correlation is indeed a useful index for measuring dependence between AMFS data at two
sites, it is also considered as a relatively useful measure of dependence when looking at a
group of N sites. It seems logical that correlation as a measure of dependence needs to be
developed for a particular region, or a network of sites within a region. In this study a
representative value of correlation for a region is selected being the mean value. The
definition of a typical region for analysis in this context is given below in section 7.8.7.
29
23
1816
11
0
5
10
15
20
25
30
35
N = 2 N = 4 N = 8 N = 16 N = 32
Number of Sites in a Network
Ave
rage
Con
curr
ent R
ecor
d Le
ngth
Figure 46 Average concurrent record lengths for different network sizes
CHAPTER 7
205
7.8.6 DEFINITION OF A REGION FOR ANALYSIS
Consider a region in which N = 2, 4 or 8 site/gauge networks could be picked. Obviously
there are many ways in which these networks could be selected. For this analysis, an
extensive experiment is required to establish a measure of the typical degree of dependence
in networks of size of N = 2, 4 and 8. In the experiment, each state was considered as a
single region, except NSW, VIC and QLD which were combined into one region as the
stations in these states form a contiguous region in geographical space.
7.8.7 METHODS OF SAMPLING REGIONAL MAXIMA
In this analysis it was necessary to adopt a flexible approach to sampling regional flood
maxima for different network sizes within a specified region. Here, three distinct methods
or experiments were adopted: (i) ROI network method, (ii) random ROI network method
and (iii) a totally random network method. It should be noted here that the main aim is to
establish a “regional maximum of N” growth curve which can be associated with a given
network size and region, and which can be considered representative of the flood region
under study. It is assumed in the following explanations that the floods for each gauged site
have been standardised according to Equation 7.3. A brief explanation of the experiments
undertaken is given in the following sections.
Given that the real data (i.e. Australian AMFS dataset used here) has issues relating to
sampling variability and homogeneity; with this in mind, simulated data was generated and
used in the experiments as well which provides control over issues relating to sampling
variability and homogeneity in the investigation. More detail about the generated dataset is
given in section 7.10.
7.8.8 ROI AND RANDOM ROI NETWORK METHODS
In this case, a focal point (i.e. a streamflow gauging site) is established in a region. Once
this is selected, a network of N gauges is chosen based on the closest N sites (the distance
criteria used for the ROI is based on geographical distance) to the focal point (more detail
about the ROI approach can be seen in section 3.7). Once selected, the regional maxima are
formed for those years for which N gauges have valid annual maxima. The GEV
CHAPTER 7
206
distribution is fitted to the regional maximum series. This procedure is repeated for every
site in the region, hence yielding a different regional maximum curve. A regional average
curve was determined for each network in the same way. For the random ROI network
method a focal point is established in the region. Once selected, the closest 20 stations to
the focal point are pooled and a network of N gauges is selected randomly from the 20
sites. The rest of the steps are presented with the ROI network approach.
7.8.9 THE TOTAL RANDOM NETWORK METHOD
The random method can be considered to be more flexible in that a different set of N sites
can be selected at each iteration for the region under consideration. If not all the sites in an
iteration have valid annual maximum flood data, a further random set of N sites is selected.
Because of the random nature of the method, it is desirable to carry out a number of
repetitions and to average the results, which is what was done in this study.
7.8.10 COMPARING SAMPLING METHODS
The main differences between the sampling approaches are that:
The ROI and random ROI network methods give more information about the
variability within a region and are more useful when investigating small networks
which are highly correlated. It is also noted, the ROI networks would tend to bias
the networks towards high correlation values.
The ‘total random network’ is more likely to make use of longer record lengths; if
one of the N sites does not have a annual maximum flood value for the years in
question, another set of sites is selected instead. Importantly, the total random
network approach averages the results over the region in a more statistically
meaningful manner than the ROI network method and is likely to sample over a
broader range of correlation values.
In all, 8,292 experiments were carried out on real and simulated datasets. The results
associated with the experiments above are discussed in detail in Chapter 8.
It should be remembered that the above approach is adopted in the light of providing a
reasonable inference on spatial dependence. Spatial dependence between annual maximum
floods can be complicated by differing response characteristics of catchments. However, it
CHAPTER 7
207
is generally accepted that physical differences between catchments become less influential
at higher return periods.
7.9 MEASURES OF Ne – EFFECTIVE NUMBER OF INDEPENDENT
STATIONS
The main objective of this study is to assess the degree of spatial dependence in annual
maximum floods, so that this can be taken into account with the LFRM model. In most
cases, some generalisation must be achieved so that these assessments can be made for
networks and ungauged sites. Generalising spatial dependence using a spatial dependence
model is discussed in Chapter 8. As a precursor to defining a spatial dependence model one
must explore ways in which the “regional maximum and typical growth curves” can be
compared for the flood data used (both real and simulated). Given the high number of
experiments carried out, the use of a summary index by which the regional maximum
curves can be related to their typical curve counterparts is also explained.
Three such indices that may be considered are: the epicentrage coefficient (Galea et al.
1983), Buishand’s dependence function method (Buishand, 1984) and the effective number
of independent stations (Dales and Reed, 1989 and Nandakumar et al. 1997 and 2001). This
study concentrates on the ‘effective number of independent stations’ concept.
7.9.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS, Ne
An alternative approach to indexing the position of the regional maximum curve relative to
the typical curve is to examine their horizontal separation on a Gumbel probability plot,
indexing this by an effective number of independent stations (Dales and Reed, 1989), Ne.
Consider the AMFS for N gauges (stations/sites) from a homogeneous region, so that these
are identically distributed as Ft(x). Ft(x) is the distribution function of the typical growth
curve. Thus:
x)(X...x)(Xx)(X(x)F 21t Nprobprobprob (7.10)
If there is spatial independence, i.e. if the AMFS data at the N gauges are entirely
independent, the distribution of the regional maximum floods of the N gauges is given
simply by:
CHAPTER 7
208
N
Nr prob (x)].[Fx))X...,X,(max(X(x)F t21 (7.11)
If, however there is complete dependence (i.e. when there is perfect correlation between the
stations AMFS), the distribution function for the regional maxima would be:
(x)][F(x)][F tr (7.12)
In real world problems, there would always be partial dependence and the degree of
dependence will vary at different quantiles, x. This is recognised by defining an effective
number of independent stations Ne (x), such that:
(x)N
te(x)][F(x)][F r (7.13)
Thus:
(x)lnF/(x)lnF(x)N tre (7.14)
and
(x))lnFln((x))lnFln((x)lnN tre (7.15)
It is simply seen that (x)lnNe is the horizontal separation of the regional maximum and
typical growth curves on Gumbel probability scale and can be seen in an example plot in
Figure 47 and Equation 7.16, i.e.:
rte XX(x)lnN (7.16)
If the assumption is followed that the degree of spatial independence cannot be less than
total dependence (Ne =1) nor greater than complete independence (Ne =N), the following is
expected:
N(x)N1 e for all x. (7.17)
CHAPTER 7
209
Figure 47 Example plot of regional maximum and typical growth curves and the effective number of
independent stations on a Gumbel plot for a random network of 2 and 4 gauging sites in Tasmania
7.9.2 A SIMPLE MODEL FOR Ne
In this study a relatively simple model of spatial dependence was obtained by ignoring the
possible variation of Ne with ARI. Hence the representation of spatial dependence reduces
to fitting a one-parameter model to relate the position of the regional maximum to the
typical growth curve. The one parameter by all means is Ne.
As reported by Dales and Reed (1989) the maximum of Ne independent GEV distributions
– where Ne is some constant – is a GEV with the following parameters:
tettrtN /)1( (7.18)
t
etr N (7.19)
tr (7.20)
Eliminating Ne, and setting tr , we have
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
-1 0 1 2 3 4
Gumbel reduced variate (Y T)
Sta
ndar
dise
d an
nual
max
ima
1 10 100ARI (Years)
Max of 2
Max of 4
Typical curve
Regional maximum values for anetwork of 4 stationsXr
lnNe
Xt
CHAPTER 7
210
// ttrr (7.21)
This condition implies that the lower or upper bound of the regional maximum of Ne
independent sites coincides with that of the typical growth curve, i.e.
boundttrr x // (7.22)
7.9.3 FITTING Ne BY THE MEAN
Since only one parameter is to be fitted, only the first probability-weighted moment, o , is
required. This is simply the arithmetic mean of the AMFS data, For a GEV distribution
with parameters , and the theoretical (i.e. population) mean is:
/)1(1 o (7.23)
If we apply estimates derived from the regional maximum and typical data, we have:
/)1(1 rr
r
o (7.24)
/)1(1 tt
t
o (7.25)
Hence, applying Equation 7.22 and eliminating )1( term, we obtain:
)/()(/ bound
t
obound
r
otr xx (7.26)
Finally, from Equation 7.26 the following expression is obtained:
/1)]/()[( bound
t
obound
r
oe xxN (7.27)
By standardisation we have t
o = 0, and r
o is simply the arithmetic mean of the regional
maximum values.
CHAPTER 7
211
7.10 SIMULATED DATASETS
As discussed earlier, given the limitations of the real data set to give clearly meaningful
results because of issues with sampling variability and homogeneity, it was decided to
generate synthetic datasets for each of the regions with known population correlation
coefficients. There are two important aspects of this simulation exercise: (i) to compare the
effective number of stations Ne, with those of the real dataset, and to identify any major
differences by having some control over the issues of homogeneity and sampling
variability, and (ii) when deriving a spatial dependence model for practical use the
simulated data will provide insight into identifying a suitable model function (this is
discussed in more detail in Chapter 8).
7.10.1 SYNTHETIC DATA GENERATION
For the generation of AMFS data, it was assumed that:
(i) generated data come from the same population,
(ii) data from different years are spatially independent (i.e. a particular year’s data
for a given site is not correlated with other year’s data for other sites) and
(iii) data from the same year are dependent with a given degree of serial correlation.
To represent the region’s data, the regional average standardised GEV distribution
parameters (for each state/region of Australia) of the AMFS were used in data generation.
The multi-site maxima were generated according to the following steps.
(i) For a given correlation coefficient, a vector of random multivariate normal
deviates with zero mean and a covariance matrix whose elements are the
constant cross correlations and standard deviation of the standardised regions
data is generated using the Matalas (1967) method.
CHAPTER 7
212
(ii) The normal variates vector is transformed to a GEV distribution with the
regional average standardised GEV distribution parameters of the particular
state or region.
In an effort to counteract sampling variability due to a limited record length, sequences of
annual maximum flood data for a region with 51 stations, each having a record length of
1000 years was generated. The (constant) correlation coefficient between the AMFS data
from different stations was varied from 0.0 to 0.5 in steps of 0.1. An example figure is
given in Figure 48 of a generated data set with constant correlation coefficients of 0.0 and
0.5 for TAS state. In all, 500 replicates of regional data (each replicate consists of data for
51 stations) were generated for each constant correlation coefficient.
CHAPTER 7
213
Figure 48 Example plot of generated data with different constant correlation coefficients for the state
of Tasmania
Table 40 gives the GEV distribution parameters for the parent distributions used in the data
generation and the mean parameters for the generated data for each of the regions. The
parent distribution parameters used to generate the data seem to be reasonably well
preserved by the generated model. The correlation coefficients (ρ) were not as well
preserved as the parameters, as ρ was not directly introduced in the GEV data generation
(correlated standard normal deviates were generated and then transformed to a GEV
distribution). In any case, the strict preservation of a particular ρ is not that important for
this analysis; the essential requirement is to know what the average ρ value is particularly
when generalising the spatial dependence model (see Chapter 8). This ρ is then assumed to
represent the population correlation coefficient.
CHAPTER 7
214
Table 40 Comparison of the parameters of the parent distribution and the distribution for
the generated data (distribution: F(x)=exp[-1-(x-)/1/]) and correlation coefficient, ρ.
Parameters
Region ρ
NSW+QLD+VIC Parent Gen. Parent Gen. Parent Gen. Parent Gen.
0.00 -0.0022 -0.488 -0.493 0.652 0.656 -0.149 -0.156
0.10 0.086 -0.488 -0.491 0.652 0.655 -0.149 -0.152
0.20 0.172 -0.488 -0.495 0.652 0.654 -0.149 -0.158
0.30 0.267 -0.488 -0.492 0.652 0.665 -0.149 -0.151
0.40 0.357 -0.488 -0.491 0.652 0.663 -0.149 -0.149
0.50 0.451 -0.488 -0.492 0.652 0.663 -0.149 -0.153
TAS 0.00 0.00023 -0.574 -0.576 0.982 0.988 -0.0073 -0.0067
0.10 0.094 -0.574 -0.5749 0.982 0.978 -0.0073 -0.0062
0.20 0.175 -0.574 -0.579 0.982 0.982 -0.0073 -0.004
0.30 0.287 -0.574 -0.571 0.982 0.979 -0.0073 -0.007
0.40 0.385 -0.574 -0.562 0.982 0.978 -0.0073 -0.008
0.50 0.481 -0.574 -0.575 0.982 0.986 -0.0073 -0.006
WA 0.00 0.0003 -0.500 -0.500 0.685 0.682 -0.158 -0.162
0.10 0.082 -0.500 -0.495 0.685 0.683 -0.158 -0.151
0.20 0.173 -0.500 -0.494 0.685 0.693 -0.158 -0.159
0.30 0.264 -0.500 -0.508 0.685 0.689 -0.158 -0.160
0.40 0.356 -0.500 -0.496 0.685 0.687 -0.158 -0.155
0.50 0.461 -0.500 -0.510 0.685 0.685 -0.158 -0.160
NT 0.00 0.0089 -0.503 -0.505 0.755 0.748 -0.0831 -0.0836
0.10 0.089 -0.503 -0.502 0.755 0.751 -0.0831 -0.0833
0.20 0.169 -0.503 -0.499 0.755 0.759 -0.0831 -0.0841
0.30 0.283 -0.503 -0.503 0.755 0.748 -0.0831 -0.0827
0.40 0.375 -0.503 -0.502 0.755 0.752 -0.0831 -0.0867
0.50 0.454 -0.503 -0.507 0.755 0.763 -0.0831 -0.0819
SA 0.00 0.0009 -0.496 -0.496 0.753 0.750 -0.0762 -0.0761
0.10 0.083 -0.496 -0.493 0.753 0.755 -0.0762 -0.0762
0.20 0.189 -0.496 -0.493 0.753 0.751 -0.0762 -0.0751
0.30 0.253 -0.496 -0.491 0.753 0.753 -0.0762 -0.0776
0.40 0.355 -0.496 -0.489 0.753 0.753 -0.0762 -0.0762
0.50 0.484 -0.496 -0.497 0.753 0.753 -0.0762 -0.0752
*Gen. = generated data
CHAPTER 7
215
7.11 SUMMARY
The main steps in this chapter can be summarized as follows. On the onset of this chapter
the LFRM concept was discussed briefly and the issue of inter-site dependence was
introduced and discussed in the light of the application of the LFRM. The chapter also
describes the comprehensive Australian AMFS dataset and the quality checks undertaken
to make the data suitable for use with such an application.
Identifying an appropriate probability distribution is an important step in deriving a general
spatial dependence model. In this chapter, different goodness-of-fit tests were used to
establish a suitable distribution to describe the AMFS, which included the L-moment ratio
diagram, the DISTZ statistic of Hosking and Wallis (1991), the Anderson-Darling (AD)
Monte Carlo simulation test and visual inspections. It was found that the GEV distribution
was the most appropriate to approximate the AMFS data. Testing for homogeneity was also
undertaken using the homogeneity test of Hosking and Wallis (1993) and the Bootstrap AD
test. Both tests showed that strict homogeneity could not be established for any of the
Australian states or Australia as a whole. In relation to homogeneity for the purpose of this
analysis it was found that there is insufficient evidence to reject the assumption of
homogeneity of the largest values in the regional sample.
This chapter then described the development of the LFRM for the Australian dataset
allowing for spatial dependence. The LFRM as outlined in this chapter has successfully
enhanced the method introduced by Majone et al., (2007) by using up to 5 maximum flood
values from each site (rather than just the largest value). The results and derived formulae
were given and discussed in some detail. Given that spatial dependence reduces the net
information for RFFA, the effects of inter-site dependence on the LFRM were discussed in
detail based on the “effective number of stations (Ne) concept”. Methods for pooling
recorded flood data, the issues regarding regional maximum floods at a network of sites,
influencing factors on the regional maxima and cross-correlation were introduced and
discussed in detail. This then provided the motivation to present the theory for the methods
for defining and determining a region for analysis which also included sampling regional
maxima. Furthermore, this chapter also discussed three network sampling methods which
were used in this study; they were the ROI, random ROI and total random networks. The
methodology for estimating Ne based on the GEV distribution was then described as
CHAPTER 7
216
outlined in Dales and Reed (1989). Finally, given the limitations of the real data set to give
clearly meaningful results in relation to the derivation of Ne because of issues with
sampling variability and homogeneity, it was decided to generate synthetic datasets for
each of the regions for use in the analysis.
CHAPTER 8
217
CHAPTER 8: APPLICATION OF LFRM IN THE LIGHT OF
SPATIAL DEPENDENCE – RESULTS AND DISCUSSION
8.1 GENERAL
This chapter begins by looking at the detailed results and the typical behaviour of the
number of independent sites (Ne) for both the real and simulated datasets. The chapter then
goes on to describe how a general model for spatial dependence was achieved. A detailed
discussion is also provided on the generalised spatial dependence model.
Finally, the large flood regionalisation model (LFRM) is revisited for the Australian
continent in the light of spatial dependence (i.e. LFRM combined with the developed
spatial dependence model). The LFRM is then coupled with Bayesian generalised least
squares regression (BGLSR - to estimate the mean and coefficient of variation (CV) of the
AMFS data) to estimate large to rare floods for gauged and ungauged catchments. A split-
sample validation is undertaken to compare the results of the LFRM with established
methods such as the parameter regression technique (see Chapters 3 and 5) and
international methods on large floods (i.e. World Model).
8.2 RESULTS FOR Ne
Following the procedures described in sections 7.8.7 to 7.8.10 the different network
methods were used to establish an indication of the typical degree of dependence in
network sizes of N = 2, 4 and 8. This was carried out on the real and simulated datasets
with the main purpose of describing the typical spatial dependence in each region/state
separately.
The Ne values were obtained for these different network sizes by fitting the mean as
described in section 7.9.2 and 7.9.3 and are detailed in Tables 41 and 42 for the real and
simulated datasets for the different networks and regions. It can be seen that the total
random network exhibits less spatial dependence than both the ROI and random ROI
networks. This finding is not surprising as sites that are closer together are more likely to
show more spatial dependence. This can be seen in all the regions when comparing the Ne
values for the different N-sized networks.
CHAPTER 8
218
Table 41 Experimental values of Ne for different networks and regions using the real data
(average Ne over the experiment reported)
Real data set
Networks /
Region
Number of gauges (sites), N
ROI &
RANDOM ROI
2 4 8
NSW+QLD+VIC 1.74 3.03 5.36
TAS 1.60 2.55 4.04
WA 1.62 2.72 4.43
NT 1.72 2.89 5.09
SA 1.50 2.20 3.21
TOTAL
RANDOM
Number of gauges (sites), N
2 4 8
NSW+QLD+VIC 1.90 3.66 6.80
TAS 1.83 3.30 5.87
WA 1.88 3.59 7.00
NT 1.81 3.40 6.61
SA 1.66 2.59 3.93
Importantly the same features as above can be seen in the simulated data; however the
simulated data shows less spatial dependence in the ‘total random network’ as compared to
the real dataset. What is worth noting here, in the case of the simulated data, is that for
most of the regions at networks of size 8, there is more of a tendency to independence than
the smaller network sizes, this is more evident in the total random network. From Tables 41
and 42 it can be seen that all the regions and the different N-sized networks that the spatial
dependence in SA and TAS is more severe. This result coincides with these regions being
much smaller than the other regions examined here, and as such the sites are located within
a closer proximity of each other. Overall, the results of the simulated datasets are in
agreement with the real data, which is pleasing.
CHAPTER 8
219
Table 42 Experimental values of Ne for different networks and regions using the simulated
data (average Ne over the experiment reported)
Simulated data
set
Networks /
Region
Number of gauges (sites), N
ROI &
RANDOM ROI
2 4 8
NSW+QLD+VIC 1.75 2.89 4.81
TAS 1.71 2.88 4.75
WA 1.73 2.94 4.91
NT 1.73 2.93 4.88
SA 1.60 2.55 4.18
TOTAL
RANDOM
Number of gauges (sites), N
2 4 8
NSW+QLD+VIC 1.93 3.66 6.96
TAS 1.93 3.71 7.08
WA 1.94 3.74 7.20
NT 1.94 3.73 7.18
SA 1.74 3.01 4.72
8.3 A CLOSER LOOK AT THE BEHAVIOUR OF Ne
Continuing on the discussion from the above, the Ne values were more closely analysed. It
was noted throughout the experiments that violations of the constraint NNe was a
recurring feature especially for the 2 and 4 gauge networks and less frequently with the 8
gauge networks for all the regions. This was more noticed with the real dataset than the
simulated data. For the real data set the worst of the violations happened with the total
random network. Figure 49 provides an example illustration of these violations for the
NSW+QLD+VIC region (the results for the other states can be seen in Appendix C). The
top three plots show the results associated with the real data, while the bottom three plots
illustrates the simulated data. The first 400 experiments depict the results of the ROI and
random ROI networks, while the last 400 experiments represent the total random sampling
experiments. With the real dataset, it can be clearly seen that there is a distinct change in
the pattern of Ne with experiment number, where the ROI and random ROI clearly show
CHAPTER 8
220
that there is more spatial dependence between sites. Similar results were obtained for the
other regions except SA. These results are also supported by the simulation results where it
can be seen that stations with a low average correlation coefficient are usually spatially
independent and in some cases violate the NNe condition as well. With the simulated
datasets this usually occurred when the average correlation was negative. However it can
be seen from Figures 49 and 50 that the violations were less frequent as the network size
was increased. Figure 50 provides the histogram of the number of times NNe showing
up in a particular class size. Here the results for the real and simulated datasets are provided
for the NSW+QLD+VIC region. (The results for some of the other states can be seen in
Appendix C). Indeed, it can be observed that there are many places where the
NNe condition is not satisfied. This is more noticeable for the real dataset (top three
plots) as a wider range of cross correlation is experienced as compared to a controlled
simulation. It was noticed that there was a reasonable number of very low and negative
average correlation coefficients in the analysis; this was observed for all the regions’
analysed. This raises the question of the possibility of negative dependence in the real data
set, while this has not been looked at closely in this analysis, it would be worthwhile doing
a closer investigation on this issue at a later stage. In any case, given the low concurrent
record length between sites and the inherent assumptions in the modelling, some of these
violations may be attributed to symptomatic limitations in this GEV-based method.
Therefore it seems possible that the violations are just due to sampling effects and the fact
that the data has been standardised by the mean and CV (Equation 7.3, Chapter 7) when
estimating Ne. Dales and Reed (1989) arrived at similar conclusions, however, Dales and
Reed (1989) standardised the data by the mean as per index flood approach.
CHAPTER 8
221
8004000
2.5
2.0
1.5
1.08004000
5
4
3
2
18004000
8
6
4
2
3001500
2.0
1.9
1.8
1.7
3001500
4.0
3.5
3.0
2.53001500
8
7
6
5
4
N = 2
Experiment Number
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
NSW+ QLD+ VIC
Varying correlation coefficient from 0 to 0.5
Figure 49 Variation of Ne with different network methods and experiment number for
NSW+QLD+VIC region (top panel for real data and bottom panel for simulated data)
2.62.42.22.01.81.61.41.2
60
45
30
15
04.84.23.63.02.41.81.2
80
60
40
20
08.757.506.255.003.752.501.25
48
36
24
12
0
2.041.981.921.861.801.741.68
40
30
20
10
04.23.93.63.33.02.72.4
40
30
20
10
08.88.07.26.45.64.84.03.2
30
20
10
0
N = 2
Freq
uenc
y of
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
1000103
5
9
2123
41
53
4548
4343444645
36
29
37
2831
36
20
109
22 102
6
12
29
41
73
6768
47
39
47
69
54
68
44
34
108
22
15
2623
26
42
38
48
333330
14
2219
14
26
32
3940
29
47
40
17
965
1
6
23
1817
21
16
24
40
28
1514
20
12
7
25
16
3
11
33
8
15
31
11
27
17
7
34
10
30
21
4
38
9
1
19
26
5
01
29
19
2
6
30
15
0
8
34
9
0
2625
0
22
29
NSW+ QLD+ VIC
Figure 50 Frequency of Ne with different network methods for NSW+QLD+VIC region (top panel for
real data and bottom panel for simulated data)
CHAPTER 8
222
While a constant Ne model was assumed for use with the LFRM, further investigations
were carried out that looked at the possible variation of Ne with respect to ARI for the same
set of experiments but only focussing on the real dataset. Table 43 summarises the results
for the different sized networks and regions. It can be noticed that for the larger regions
(NSW+QLD+VIC and WA) the degree of spatial dependence is broadly similar and that
spatial independence is reached at relatively low ARIs. The smaller regions, or regions
where stations are closely clustered (TAS, SA and to a lesser extent NT) show more
dependency as slightly higher ARIs are associated with these regions to reach
independence; this is the case for TAS and NT (see Table 53). However it can be seen that
SA never reaches independence for a particular ARI, which suggests that these stations are
highly cross correlated. If one looks at the location of the stations in SA (see Figure 16)
they are found to be in very close proximity of each other.
Table 43 Experimental results in which Ne exceeds N at a particular ARI for different
regions using the real data set
Real data set
Networks /
Region
Number of gauges (sites), N and ARI at which Ne=N
2 4 8
NSW+QLD+VIC 5.9 9.9 13.9
TAS 7 23.2 37.6
WA 5.7 8.2 9.3
NT 11.9 15.5 28.4
SA * * *
*SA never reaches independence
Overall, in analysing the real and simulated data experiments, the evidence available
suggests that spatial dependence for the Australian AMFS data reduces with larger regions,
networks and ARIs, where as for the smaller regions spatial dependence is more evident.
Hence, it is noted that the overall modelled Ne values may be inherently uncertain (i.e.
when applied to estimate large ARIs), which would inturn overestimate the ARI of interest.
However, when put into perspective, at present a hydrologist is only able to make an
estimate of a large design flood with an associated ARI based on the assumption that all the
sites in a region are totally independent, which indeed would lead to underestimation of the
ARI of interest. Therefore, the analysis undertaken in this study should only be seen as
CHAPTER 8
223
providing a new framework of risk assessment for large to rare flood estimation rather than
a perfect answer. As such, this approach can be expected to provide reasonably accurate
risk assessments at the higher ARIs, which are of interest in the application of the LFRM
model.
8.4 GENERALISING THE Ne MODEL
Deriving a general model of spatial dependence is not straightforward. If one places too
much emphasis on a particular aspect of the experimental results, it may result in many
regional sub models, which would introduce significant regional variations when the spatial
dependence models are applied. In this case a regional approach is still warranted, where a
suitable model is used to describe the spatial dependence in each region/state separately
and then combining the results to frame one relationship to use for all of Australia. In this
study, a relatively simple model of spatial dependence was obtained by ignoring the
possible variation of Ne with ARI.
Regression analysis using unweighted ordinary least squares is used to relate Ne to the
average correlation coefficient (ρ) of concurrent AMFS at pairs of stations for the different
networks and regions for each of the adopted 8,292 experiments (this includes the real data
and simulated data). To derive the regression equation it was determined to be more
appropriate to build a general model that relates the ratio lnNe/lnN to the average
correlation coefficient (ρ). Dales and Reed (1989), showed that the ratio lnNe/lnN provides
a neat index of the degree of spatial independence in annual maximum data, the index
ranging between 0 (total dependence) and 1 (total independence). The derived spatial
dependence models and the regression analysis are provided below.
8.4.1 CONSTANT Ne MODEL – AN EMPIRICAL RELATIONSHIP FOR Ne
BASED ON AVERAGE CORRELATION COEFFICENT (ρ)
The form of constant Ne model is given by Equation 8.1 which was calibrated by
combining all the models for each of the Australian states into one generic equation. The
final form of Equation 8.1 was identified by investigating the real and simulated data sets:
baN
Ne ln
ln (8.1)
CHAPTER 8
224
In all the regions, the one variable model (see Equation 8.1) provided a relatively good fit
to the experimental data. The fitted parameters of the constant Ne model for all the states
individually and Australia (overall) are given in Table 44 for the real and simulated
datasets. The final values as seen in Table 44 are the average coefficient values over the
networks and experiments for each study region. The final parameter values for the general
Australian spatial dependence model was found by combining the different network values
of the ratio lnNe/lnN and developing a regression equation of the form represented by
Equation 8.1 and then taking the average of the coefficient values of the developed
regression equations. Figures 51 to 53 show the typical results for each network in
determining the final Australian spatial dependence model with the real dataset. A similar
procedure was carried out for the simulated dataset.
1.000.750.500.250.00-0.25-0.50
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Average correlation coefficient
ln(N
e)/ln
(N)
S 0.0726382R-Sq 88.7%
Regression95% CI
N = 2
CHAPTER 8
225
5.02.50.0-2.5-5.0
99.99
99
90
50
10
1
0.01
St andardised Residual
Pe
rce
nt
1.21.00.80.60.4
4
2
0
-2
-4
Fit t ed Value
Sta
nd
ard
ise
d R
esi
du
al5.003.752.501.250.00-1.25-2.50
160
120
80
40
0
Standardised Residual
Fre
qu
en
cy
1200
1100
100090
080
070
060
050
040
030
020
010
01
4
2
0
-2
-4
Observat ion Order
Sta
nd
ard
ise
d R
esi
du
al
Normal Probabilit y Plot Versus Fit s
Histogram Versus Order
Residual Plots (N = 2)
Figure 51 Regression results of the N = 2 network combining the lnNe/lnN ratio values for all the
Australian states/regions and experiments
1.00.80.60.40.20.0-0.2-0.4
1.2
1.0
0.8
0.6
0.4
0.2
Average correlation coefficient
ln(N
e)/ln
(N)
S 0.0597907R-Sq 88.7%
Regression95% CI
N = 4
CHAPTER 8
226
5.02.50.0-2.5-5.0
99.99
99
90
50
10
1
0.01
Standardised Residual
Pe
rce
nt
1.21.00.80.60.4
5.0
2.5
0.0
-2.5
-5.0
Fit t ed Value
Sta
nd
ard
ise
d R
esi
du
al
3.752.501.250.00-1.25-2.50-3.75-5.00
160
120
80
40
0
Standardised Residual
Fre
qu
en
cy
1200
1100
100090
080
070
060
050
040
030
020
010
01
5.0
2.5
0.0
-2.5
-5.0
Observat ion Order
Sta
nd
ard
ise
d R
esi
du
al
Normal Probabilit y Plot Versus Fit s
Hist ogram Versus Order
Residual Plots (N = 4)
Figure 52 Regression results of the N = 4 network combining the lnNe/lnN ratio values for all the
Australian states/regions and experiments
1.000.750.500.250.00
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Average correlation coeffcient
ln(N
e)/ln
(N)
S 0.0676152R-Sq 83.6%
Regression95% CI
N = 8
CHAPTER 8
227
5.02.50.0-2.5-5.0
99.99
99
90
50
10
1
0.01
Standardised Residual
Pe
rce
nt
1.00.80.60.4
5.0
2.5
0.0
-2.5
-5.0
Fit t ed Value
Sta
nd
ard
ise
d R
esi
du
al
1200
1100
100090
080
070
060
050
040
030
020
010
01
5.0
2.5
0.0
-2.5
-5.0
Observat ion Order
Sta
nd
ard
ise
d R
esi
du
al
3.752.501.250.00-1.25-2.50-3.75
160
120
80
40
0
Standardised Residual
Fre
qu
en
cy
Normal Probabilit y Plot Versus Fit s
Hist ogram Versus Order
Residual Plots (N = 8)
Figure 53 Regression results of the N = 8 network combining the lnNe/lnN ratio values for all the
Australian states/regions and experiments
Figures 51 to 53 show that the standard error (s – on the graph) associated with regression
equations is quite modest which suggests that the most of the variability in the lnNe/lnN
ratio can be well explained by the average correlation coefficient in a network of sites, this
can also be seen by the narrower 95% confidence interval on the prediction. It is also
observed that there are some outliers in the analysis with the standardised residuals being
close to the 5 limit. These values were further examined by removing them from the
analysis; however, removing these values did not provide any further benefit. Hence, the
outliers in the regression analysis were finally retained.
The coefficient of determination (R2) values for the final models (see Table 44) fitted to the
real and simulated data sets are quite high, suggesting that the use of the constant Ne model
should result in improved Ne estimates compared to the values calculated directly from the
AMFS data in each station network (This is more apparent however for the simulated
spatial dependence model). The comparison of the fitted Ne values for the real and
simulated data computed using Equation 8.1 and that by the spatial dependence Equation
7.27 (see Chapter 7) are shown in Figure 54. The figure (real data) illustrates that the
CHAPTER 8
228
scatter in the spatial dependence model estimates increases with increasing N. The scatter
in the real dataset results may also be attributed to natural and sampling variability from
site to site given that the concurrent record length for analysis was very modest and that
strict homogeneity was not established. Also, further scatter could be attributed to the
overall limitation in the GEV methodology used here in estimating Ne. This introduces
higher uncertainties in Ne estimates for larger N values which really just reflects the larger
number of data points in the larger networks. This would certainly have a detrimental effect
for large flood quantile estimation. Figure 54 and Table 44 show the overall satisfactory
performance of Equation 8.1 as mostly the simulated and real dataset results are quite
similar.
Table 44 Properties of the Constant Ne Spatial dependence model
Real data Simulated data Region/State
a b R2 % a b R2 %
NSW+QLD+VIC 0.99 -0.66 89 1 -0.63 99
TAS 0.98 -0.59 79 1 -0.63 99
WA 0.99 -0.61 83 1 -0.63 99
SA 1.02 -0.75 84 1 -0.63 99
NT 0.99 -0.59 64 1 -0.62 99
All AUSTRALIA 1 -0.66 88 1 -0.63 99
CHAPTER 8
229
2.52.01.51.00.50.0
2.5
2.0
1.5
1.0
0.5
0.0
ln (Ne) from data
ln (
Ne)
Con
stan
t N
e M
odel
2 stations4 stations8 stations
Real Data
2.52.01.51.00.50.0
2.5
2.0
1.5
1.0
0.5
0.0
ln (Ne) from data
ln (
Ne)
Con
stan
t N
e M
odel
2 stations4 stations8 stations
Simulated Data
Figure 54 Comparison of directly computed Ne from the AMFS data and Ne by the constant Ne model
CHAPTER 8
230
8.4.2 FURTHER DISCUSSION
The coefficients of Equation 8.1 given in Table 44 suggest that the estimated constant Ne
N for independent stations ( 0 ) as should be the case. However, for totally dependent
flood data ( 1 ), the estimated constant Ne 1, in contrast to the theoretical expectation.
This could be a manifestation of the simple equation for the relationship between the
estimated constant Ne, N and used. It is noted a quadratic equation might improve the fit
in the regions with high correlation. Indeed for 1 , the errors in the estimated constant Ne
are high for large N values. However, it should be kept in mind that the aforementioned
issue would have little effect on the estimates from the methods using the LFRM approach,
as the average correlation coefficient between sites is normally much smaller than one.
The use of the constant Ne model in applications with the LFRM approach is quite general
and can be used anywhere in Australia. The main concern which could arise might be the
difficulty in calculating the correlation coefficient for pairs of stations where only a limited
concurrent flood record is available. In such a situation the alternative approach would be
firstly to compute the correlation coefficient from a regional relationship with distance (see
Figure 39 – Chapter 7) and then to apply Equation 8.1.
8.5 COMPARISON OF THE EFFECTIVE RECORD LENGTH ESTIMATES
USING THE CONSTANT Ne MODEL FOR THE REAL AND SIMULATED
DATASETS
The effective record lengths were estimated using the real and simulated constant Ne
models and Equation 7.9 (see Chapter 7). Figure 55 shows the typical variation of the total
record lengths and the effective record lengths from the real and simulated constant Ne
models. As expected, the differences are modest and that the effective record length
estimates from the simulated Ne model are slightly higher than those of the real Ne model. It
can also be observed that the average correlation coefficient consistently decreases with
increasing number of stations. This may be attributed to the fact that the extreme
observations from a network tend to be more independent, regardless of the high degree of
correlation which more frequent flows may exhibit. Similar results were found by
Nandakumar et al., (1997) with rainfall data.
CHAPTER 8
231
0
0.2
0.4
0.6
0.8
1
1 10 100 1000
Number of Stations
Ave
rage
cor
rela
tion
coef
ficie
nt
0
5000
10000
15000
20000
25000
Rec
ord
Leng
th (y
ears
)
Australia Ne (Real)Australia Ne (SIM)Australia L (Real)Australia Le (Real)Australia L (SIM)Australia Le (SIM)
Figure 55 Variation with number of sites: effective record lengths estimated using real and simulated
Ne models as a function of average correlation coefficient
8.6 REVISITING THE LFRM IN THE LIGHT OF SPATIAL DEPENDENCE
The LFRM for the study data in its current form (see Equations 7.7 and 7.8 – Chapter 7)
does not allow for the effect of inter-site dependence which reduces the net information
available for regional analysis. In this section spatial dependence is accounted for through
the use of the spatial dependence model derived in the previous sections (see Equation 8.1),
which defines the effective number of independent stations in a region (Ne) as a function of
the average correlation coefficient in the region. For this study, the use and calculation of
Ne for application with the LFRM is illustrated. Firstly, the average correlation for each
pair of sites was calculated for each state/region. The average correlation coefficients are
shown in Table 45.
CHAPTER 8
232
Table 45 for each pair of sites for the different states/region
Secondly, using Equation 8.1 along with the coefficients for the Australian spatial
dependence model given in Table 44 (using the real and simulated data) and the average of
(0.26) the Ne was estimated. The calculated Ne value along with the effective record
length is given Table 46. One can see from Table 46 that the results from the real data
match reasonably well with the simulated data, which represents the result if the region
were truly homogeneous. Another way of estimating the number of effective sites would be
to use the individual coefficient results for each state/region from Table 44 with Equation
8.1 along with the average correlation coefficient from Table 45. However, as discussed in
section 8.4, significant regional variability from state to state may exist and the use of a
general model i.e. the model using the Australian model coefficients is preferred.
Table 46 Total record length (L) and effective record length (Le) for the all Australian
dataset
Region N L Constant Ne model – real
coefficients
Constant Ne model –
simulated coefficients
Ne* Le Ne
* Le All Australia
626 21049 207 (33%) 6969 228 (36%) 7654
* Ne values in parentheses are percentages of N
Using the calculated Ne value of 207 (from the real dataset) in Equation 7.6 (Chapter 7)
instead of the total number of stations (626) to estimate the new plotting position of the
pooled data points (1 max, 3 max and 5 max), the new interpolated curve for Equation 7.7
(Chapter 7) becomes:
Region/State
NSW+QLD+VIC 0.22
TAS 0.20
WA 0.21
SA 0.42
NT 0.25
Average of 0.26
CHAPTER 8
233
NeNeNeCYCYCY TT 32
2
1max )()( (8.2)
Equation 8.2 is then substituted into Equation 7.8 which yields the new definition i.e.
Equation 8.3 of the LFRM that has corrected for the spatial dependence in the dataset.
Equations 8.2 and 8.3 yield the analytical expression of the LFRM model for the study data
set using the 1, 3 and 5 maxima. The appropriate values of the coefficients of Equations 8.2
and 8.3 are given in Tables 38 (Chapter 7) and 47. One can clearly see the difference in the
coefficients of the LFRM when comparing the results of the dataset using N and Ne sites;
this is due to the reduction of the total useful information (i.e. the effective number of
stations). The new interpolated frequency curves can be seen in Figure 56 (top curve).
CVCYCYCcmeanQNeNeNe TT ))()((/ 32
2
1max (8.3)
Table 47 Coefficients and R2 values of Ymax polynomial interpolation from Figure 56 for N
and Ne sites
Ne sites - Ymax NeC1
NeC2
NeC3
R2
1 -0.025 0.71 1.42 0.996
3 -0.045 0.95 0.78 0.997
5 -0.054 1.06 0.44 0.999
N sites - Ymax C1 C2 C3 R2
1 -0.027 0.80 0.49 0.997
3 -0.041 0.98 -0.18 0.998
5 -0.044 1.07 -0.59 0.999
CHAPTER 8
234
Figure 56 Frequency distribution of standardised Ymax values using N and Ne stations
CHAPTER 8
235
What is indeed striking from Figure 56 is the shift upwards in the frequency curve of the
pooled data. Taking the 5 max plot for example, if one compares the Ymax value of
approximately 3, it can be seen that if one ignores the spatial dependence, the flood
magnitude risk may be notably under estimated (for N sites Ymax = 3, ARI = 55 years, for
Ne sites Ymax = 3, ARI = 20 years). For the pooling of the 5 max and correcting for spatial
dependence (see max of 5 plot in Figure 56) it was found that the range of Ymax values for
which the fitted model i.e. (referred to as LFRM_Ne henceforth) might be considered
reliable is approximately 1.5 to 5, which corresponds to ARIs of 10 to approximately 3000
years.
Figure 57 shows the behaviour of the dimensionless quantiles derived from Equations 7.8
(Chapter 7) and 8.3 for ARIs 50, 200 and 1000 years for all the pooled data (i.e. 1 max, 3
max and 5 max) for the estimated quantiles using N and Ne. The dimensionless quantiles
for the World model (referred to as the PM (world) – based on 8500 gauging stations
around the world) developed by Majone et al. (2007) is also superimposed for comparison.
The comparison with the PM (world) curves in Figure 57 indicates that the LFRM_Ne (1
max (626 data points), 3 max (1878 data points) and 5 max (3130 data points)) can explain
most of the scatter in these plots, as the set of curves (50 and 200 year ARI curves) for this
extended ARI range (including the 1000-year ARI) captures most of the upper part of the
points in the pooled data set of the Q/mean values. However the PM (world) seems to over
estimate the Q/mean values for the Australian dataset as the growth curve for Q1000/mean
is located above the scatter. The flatter slopes in Figure 57 for 3 max and 5 max (bottom 2
panels of Figure 57) is consistent with what was shown in Figure 43 (Chapter 7) and seems
to reflect a weaker relationship of Q/mean with CV. Comparison of the curves for max of 1
for Ne and N seems to indicate that allowance for spatial dependence has a smaller
influence on slope. Figure 57 indicates that the extra data with 3 to 5 max provides some
better definition of the left hand tail of the distribution (where the top few points in the
right hand tail are mostly common in all 3 data sets (1 max, 3 max and 5 max)).
CHAPTER 8
236
Figure 57 Various Qmax/mean quantiles derived from the LFRM_Ne model and PM (World) model
CHAPTER 8
237
Table 48 lists the CV values for the different states of Australia along with catchment area
and largest Ymax values of the pooled data. Figure 58 shows how the LFRM_N (i.e.1 max)
without correction for spatial dependence and the LFRM_Ne (i.e. 1 max) fit the at-site data
for the different ranges of CV values. As can be seen from this figure, the LFRM_N and
LFRM_Ne can provide reasonably accurate growth curve estimation for the ARI range of
10 to 1000-years and for CV values in the ranges 0.50 - 0.59, 0.60 - 0.69, 0.70 – 0.79, 0.80
– 0.89, 0.90 – 0.99, 1.00 – 1.10, 1.11 – 1.20, 1.21 – 1.40 and 1.41 – 1.60, and performs best
in the CV range of 0.60 - 1.60 (approximately 81% (505 out of 626) of the study
catchments fall in this range). However, the LFRM_N and LFRM_Ne perform quite poorly
for CV values ranging from 0.18 to 0.49 and 1.62 – to 2.52 and for a range of ARIs as seen
in the plots of Figure 58. One can also see that average CV values (i.e. CVave) in Table 48
all fall in the best performance range of 0.60 - 1.60.
Table 48 CV values for study catchments in Australia
State Number of stations
Average record length (years)
CVmin CVav CVmax Amin
(km2) Aav
(km2) Amax
(km2) Ymax-1
max
VIC 131 33 0.32 0.86 1.69 3 320 997 5.26
NSW 96 34 0.58 1.08 1.83 8 352 1010 5.37
QLD 172 35 0.51 1.06 2.08 7 325 963 4.84
TAS 53 30 0.23 0.64 2.02 1.3 323 1900 5.74
WA 146 30 0.28 0.96 2.52 0.2 156 7406 5.47
SA 29 35 0.42 0.91 1.71 0.6 170 708 4.33
NT 55 35 0.18 0.84 1.49 1.4 581 4325 5.26
CHAPTER 8
238
CV for 0.50-0.59 - (1 max)
0
1
2
3
4
1 10 100 1000 10000 100000
ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 0.70-0.79 - (1 max)
0
1
2
3
4
5
6
1 10 100 1000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 0.80-0.89 - (1 max)
0
1
2
3
4
5
6
1 10 100 1000 10000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 0.90-0.99 - (1 max)
0
1
2
3
4
5
6
7
1 10 100 1000 10000 100000ARI (years)
Q/m
ean
LFRM_N
LFRM_Ne
CV for 0.18-0.49 - (1 max)
0
1
2
3
1 10 100 1000
ARI (years)
Q/m
ean
LFRM_N
LFRM_Ne
CV for 0.60-0.69 - (1 max)
0
1
2
3
4
1 10 100 1000 10000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CHAPTER 8
239
Figure 58 Empirical frequency distributions of Q/mean quantiles derived from the LFRM_N and
LFRM_Ne for different ranges of CV
CV for 1.00-1.10 - (1 max)
0
1
2
3
4
5
6
7
8
1 10 100 1000 10000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 1.11-1.20 - (1 max)
0
1
2
3
4
5
6
7
8
1 10 100 1000 10000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 1.41-1.60 - (1 max)
0
1
2
3
4
5
6
7
8
9
10
1 10 100 1000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 1.21-1.40 - (1 max)
0
1
2
3
4
5
6
7
8
9
1 10 100 1000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CV for 1.62 - 2.52 - (1 max)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1 10 100 1000ARI (years)
Q/m
ean
LFRM_NLFRM_Ne
CHAPTER 8
240
8.7 APPLICATION OF THE LFRM MODEL TO UNGAUGED
CATCHMENTS
The main interest here is the application of Equation 8.3 to ungauged catchments, which
requires the estimation of the mean flood and CV for the ungauged catchment in question.
The BGLSR and ROI approach as discussed in Chapter 3 and applied in Chapter 5 was
used to develop the prediction equations for the mean flood and CV of the AMFS data as a
function of catchment and climatic characteristics (predictor variables). The prediction
equation for the mean flood used a ROI of 30-40 stations, while 65-80 stations were used
for the CV, based on the findings from past studies (e.g. Haddad and Rahman, 2012 and
Rahman et al. 2012) and which state was being analysed.
8.7.1 DERIVATION OF PRIORS FOR THE MEAN FLOOD AND CV
As discussed previously and in more detail in Chapter 3, in order to apply the Bayesian
approach to the regional regression problem, one needs to formulate and define prior
distributions for the く coefficients and for the model error variance. Following Reis et al.
(2005), no previous information on the く coefficients is available (this is the case for the
mean flood and CV), so an almost non-informative prior is used. It consists of a
multivariate normal distribution with mean zero and a large variance such that the prior
distribution is relatively flat in the region of interest.
The prior information for the model error variance 2 (for the mean flood and CV) is
represented by an informative one-parameter () exponential distribution, which represents
the reciprocal of the prior expected mean value of the model error variance:
,)/1()( /22 e where 2 > 0 (8.4)
For the regionalisation of the mean flood, was set to the reciprocal of the residual error
variance estimate from ordinary least squares regression. This value is taken as the
expected prior mean of the model error variance.
CHAPTER 8
241
Previous studies show that the model error variance of a GLS regional regression model of
scale and/or shape parameters may be zero if the method of moments (MOM) estimator is
employed (Madsen and Rosbjerg, 1997; Madsen et al. 2002; Reis et al., 2005 and Haddad
et al., 2011b). This actually implies that the regional regression model is perfect, which is
considered to be unrealistic. Here, the Bayesian approach is developed further for the
analysis of a GLSR regional model that is employed to estimate the CV of AMFS. The
BGLSR model should provide a more reasonable estimator of the regional CV and its
uncertainty than the alternative MOM approach. One may also regionalise the standard
deviation of floods as done in Chapter 5; however regionalising CV allows its use more
directly with the LFRM concept.
For the regionalisation of CV, λ was set equal to 10. The rationale behind these numbers is
explained as follows. If one inspects the Figure 58, it can be seen that the LFRM performs
best in the range of CV values from 0.6 to 1.60. Hence, if the true CV values were
uniformly distributed between 0.5 and 2, the variance would be approximately 1/5, which
means the model error variance should be less than 1/5. However, in order to be more
realistic, λ was set equal to 10. In this case, there is still a probability of 14% that 2 is
greater than 1/5.
8.7.2 ESTIMATION OF THE ERROR COVARIANCE MATRIX – ESTIMATION
OF THE SAMPLING ERROR VARIANCE
In BGLSR modelling, one requires an estimate of the sampling error covariance matrix.
However, it is difficult to obtain an exact expression for the error covariance matrix and its
estimate is generally based solely on the data as adopted by Stedinger and Tasker (1985)
and Madsen et al. (2002). In general, approximate expressions of the sampling error
variances for the mean flood and CV of floods can be formulated in terms of population
parameters. It must be noted though, to solve the BGLSR equations, the error covariance
estimator should be independent, or nearly so of the AMFS parameter estimate
iy (Stedinger and Tasker, 1985). Following a similar approach as outlined by Madsen and
Rosbjerg (1997) and Madsen et al. (2002), an estimation procedure of the sampling error
variance that is nearly independent of the two AMFS parameters is described below.
CHAPTER 8
242
For the mean flood estimation (this was derived as the average of the AMFS at a site), the
sampling error variance is given by iii n/22 where 2
i is the population variance. A
reasonable estimate of 2
i can be obtained from:
n
i
iii nqnq1
22 ˆ/1,/ˆ (8.5)
For estimation of the sampling error variance that is nearly independent of the at-site CV
estimate, the approximation suggested by Madsen and Rosbjerg (1997) and Reis et al.
(2005) is used, which is given by:
)|()ˆ( a
i
ai nyVar
n
nyVar n
i
iyn
y1
ˆ1
n
i
ia nn
n1
1int (8.6)
where )|( anyVar is the sampling variance computed as a function of the mean of the
statistic of interest (CV in this case) in the region and na is the average number of
observations in the region.
8.7.3 ESTIMATION OF THE SAMPLING ERROR – INTER-SITE
CORRELATION
For the estimation of cross correlation of parameter estimates between sites, all
corresponding AMFS data with concurrent record lengths were considered. The cross
correlation between the sample mean valuesmeanij is equal to the correlation coefficient
between the concurrent AMFS themselves. However, the correlation between higher order
sample moments depends on the order of the moment (Stedinger, 1983). For example, for
the CV estimates, the cross correlation coefficient is given by 2
ijcvij . Therefore, the
effect of cross correlation dependence would become less severe for the higher order
moments. In reality, the estimated cross correlation coefficients have reasonably large
sampling uncertainties associated with them. Therefore, direct use of the sample estimates
may result in an error covariance matrix (see Chapter 3) that cannot be inverted. To
overcome this problem, the cross correlation coefficients are smoothed by relating the
sample estimates to the distance between stations. In this study the following exponential
CHAPTER 8
243
correlation function is used:
ln
1exp
1
ij
ijd
d
ijd
dij
ij
(8.7)
where dij is the distance between stations i and j, and and are parameters to be
estimated from the data.
8.7.4 SOME ISSUES ASSOCIATED WITH REGIONAL ESTIMATION OF CV
In the plot (Figure 59), the sample values of CV calculated for the considered sites were
initially plotted against the corresponding catchment areas (an initial assumption was made
that CV might show some relationship with catchment area). It can be seen that there is a
high scatter of the data and that the high CV values correspond to a range of catchment
areas. Due to the high scatter of the data, Figure 59 cannot be used directly for the
estimation of CV in practical cases. As such the use of regression equations or more
formally the BGLSR in terms of catchment and climatic characteristics is most appealing.
Sections 8.7.5 and 8.7.6 provide further details on this.
0
0.5
1
1.5
2
2.5
3
0.01 0.1 1 10 100 1000 10000Area (km2)
CV
(Q)
Figure 59 Relationship between CV and catchment area
CHAPTER 8
244
8.7.5 SELECTION OF PREDICTOR VARIABLES
All the predictor variables as outlined in Table 3 (Chapter 4) were used as potential
predictors. Predictor variables were selected according to the approach outlined in Chapter
3 and section 3.6. To identify the model form, a fixed region approach was used where all
the catchments were considered to have formed one region (each state separately) and the
final choice for the preferred regional BGLSR model for the mean flood and CV was the
combination that best satisfied all the statistical criteria as discussed in section 3.6.
8.7.6 BGLSR RESULTS FOR MEAN AND CV
The stepwise regression procedure for selecting the best set of catchment/climatic
characteristics resulted in the following equation form (e.g. Equation 8.8 and 8.9) for the
mean flood (mean) and CV for each Australian state. The regression equations are
presented in general form below, while the coefficients expressed by its posterior mean
value, (i.e. ) for the final selected equations are tabulated in Table 49 along with the
model error variance (MEV), pseudo coefficient of determination ( 2
GLSRR ) and standard
error of prediction in (SEP) %.
mean = 0 + 1(area) + 2(2I12) (8.8)
CV = 0 (8.9)
Table 49 Summary of the finally selected BGLSR models for all the Australian states used
in the validation of LFRM
Coefficients (く)
Coefficient
(く) MEV MEV 2
GLSRR 2
GLSRR
SEP
(%)
SEP
(%)
State Mean Flood CV
Mean
Flood CV
Mean
Flood CV
Mean
Flood CV
0 1 2 0
VIC 3.72 0.61 1.14 0.88 0.29 0.0047 0.62 - 60% 15
NSW 4.62 0.69 2.05 1.14 0.29 0.0078 0.76 - 60% 15
QLD 5.20 0.65 1.70 1.06 0.16 0.0041 0.81 - 42% 12
TAS 4.77 0.79 2.11 0.56 0.39 0.016 0.80 - 72% 22
WA 0.32 0.82 1.19 0.97 0.88 0.010 0.81 - 122% 19
CHAPTER 8
245
Figures 60 and 61 show example plots of the statistics used in selecting the best set of
predictor variables for the CV model for the state of NSW. Some sample figures associated
with the other states can be seen in Appendix C. Figure 60 shows the MEV, standard error
of MEV and 2
GLSR values for the CV model. Combination 6 with a constant and two
predictor variables area and 2I12 showed the lowest MEV and the one of the highest 2
GLSR
as well as one of the lowest Akaike information criterion (AIC) and Bayesian information
criterion (BIC) values. However, the lowest average variance of prediction old (AVPO)
and average variance of prediction new (AVPN) were found for combination 1 (a constant
value - see Figure 61). The adopted combinations of the predictor variables are as noted in
Chapter 5, Table 4 column 2.
The Bayesian plausibility values (BPV) was used to carry out a hypothesis test (at the 5%
significance level) on the predictors of combination 6. The BPVs were found to be 79%
and 11% for area and 2I12 respectively; this shows the predictors not to be significant for
the estimation of CV at ungauged sites. Both the posterior coefficients く1 and く2 were less
than two posterior standard deviations away from zero supporting the results from the BPV
test that these variables are not significant.
The result above suggests that it may be possible to adopt a regional average CV value for
NSW without using any prediction equation/predictor variable. This finding is consistent
with Chapter 5 where it was found that a constant model for a regional skewness was the
best model for NSW and other Australian states. The finding above is also supported by the
fact that there was only a modest difference in the MEV values where combination 6
showed an MEV of 0.0076 compared to an MEV of 0.0078 for combination 1.
A similar outcome was observed for the estimation of CV for all the Australian states (see
figures in Appendix C). While there were cases where the prediction equations showed
reasonably high 2
GLSRR and low MEVs and AVPs, the BPV results consistently showed
these variables to be not significant. For this study, the simplest model was always
preferred.
CHAPTER 8
246
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
R2 G
LSR
0.0073
0.0074
0.0075
0.0076
0.0077
0.0078
0.0079
0.0080
0.0081
0.0082
0.0083
ME
V a
nd it
s S
tand
ard
Err
or
R-Sqd GLSR MEV Standard error of MEV
Figure 60 Selection of predictor variables for the BGLSR model for CV
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 61 Selection of predictor variables for the BGLSR model for CV using AVPO, AVPN, AIC and
BIC
CHAPTER 8
247
Figures 62 and 63 show an example plot of the statistics used in selecting the best set of
predictor variables for the mean flood model for NSW. According to the MEV,
combinations 3, 4, 5, 6, 10 and 11 were potential sets of predictor variables for the mean
flood model. Combinations 5, 6 and 11 contained 2 predictor variables with similar MEVs
and 2
GLSR .
The AVPO, AVPN, AIC and BIC values favoured combination 6, and hence this was
finally selected as the best set of predictor variables for the mean flood model which
includes area and design rainfall intensity 2I12. Both posterior coefficients く1, and く2 were
found to be 7 times the posterior standard deviation away from zero suggesting these two
variables are well defined in the prediction equation. Combination 6 was selected for all the
mean flood models for all the Australian states in the validation. The BPVs for the
regression coefficients associated with the variable area and 2I12 were found to be
significant with values smaller than 0.001%.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
MEV Standard error of MEV R-Sqd GLSR
Figure 62 Selection of predictor variables for the BGLSR model for the mean flood
CHAPTER 8
248
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 63 Selection of predictor variables for the BGLSR model for the mean flood using AVPO,
AVPN, AIC and BIC
8.7.7 BGLSR RESULTS FOR MEAN AND CV MODELS USING ROI
The regression equations based on the sets of predictor variables selected were used in the
ROI approach. The results obtained in the ROI approach were then used in the validation of
the LFRM_Ne model (i.e. Equation 8.3 in section 8.6). In the ROI approach, an optimum
region was formed for each of the 28 test catchments (see Figure 40, Chapter 7). As stated
earlier, the prediction equation for the mean flood used a ROI of 30-40 stations, while 65-
80 stations were used for the CV model, based on the findings from past studies (e.g.
Haddad and Rahman, 2012 and Rahman et al. 2012) and the state in question. The
summary of the various regression diagnostics (as described in section 3.8 and Equation
3.41, Chapter 3) for each test catchment is provided in Table 50 for the different Australian
states.
This shows that for the mean flood model (for all the states), the MEV and average SEP
values are much higher than those of the CV models. This indicates that the mean flood
model exhibits a higher degree of uncertainty than the CV models (i.e. the mean flood
CHAPTER 8
249
would introduce more uncertainty into the LFRM model as compared to the CV). An
important note to make here is that sampling error has dominated the total error in the
estimation of CV as opposed to the mean flood, where the total error is dominated by
model error, therefore in case of the CV the spatial variation is a second order effect that is
not really detectable. This is apparent in both the fixed region and ROI approaches.
Table 50 Regression diagnostics for the ROI approach for the various Australian states and
test catchments
MEV MEV 2
GLSRR 2
GLSRR SEP (%)
SEP
(%)
State /
Station
No.
Mean
flood CV
Mean
flood CV
Mean
flood CV
VIC
221210 0.21 0.007 0.61 - 50 17
225218 0.13 0.007 0.63 - 38 18
227211 0.22 0.008 0.66 - 50 18
401210 0.23 0.007 0.50 - 53 18
403213 0.22 0.008 0.59 - 51 18
404206 0.23 0.008 60 - 52 18
NSW
203012 0.23 0.012 0.80 - 54 18
210014 0.29 0.012 0.80 - 60 17
215004 0.20 0.011 0.80 - 49 17
410057 0.30 0.011 0.81 - 61 16
412050 0.33 0.011 0.81 - 65 16
419029 0.22 0.012 0.79 - 51 17
QLD
108002 0.12 0.012 0.78 - 35 17
116015 0.13 0.011 0.77 - 37 17
140002 0.19 0.009 0.64 - 44 19
416410 0.15 0.011 0.77 - 40 20
422394 0.13 0.011 0.80 - 37 20
919013 0.12 0.012 0.78 - 35 17
WA
607012 0.52 0.017 0.85 - 88 22
608004 0.49 0.016 0.83 - 83 22
610001 0.41 0.014 0.82 - 74 22
CHAPTER 8
250
610007 0.43 0.014 0.82 - 76 23
612008 0.59 0.016 0.79 - 96 23
612010 0.53 0.015 0.80 - 89 23
TAS
2204 0.42 0.024 0.75 - 68 20
4201 0.41 0.023 0.78 - 66 20
304040 0.34 0.023 0.80 - 60 20
308799 0.38 0.023 0.85 - 63 20
For the mean flood model (for all the states), the ROI approach shows a MEV value which
is generally smaller than the fixed region approach (see Tables 50 and 49). The lower MEV
value in turn also provides the lower SEP values. Also, the 2
GLSRR
values for the mean flood
model (all the states) with the ROI approach in most cases are higher than the fixed region
approach. These results indicate that the ROI approach should be preferred over the fixed
region approach for developing the mean flood model for use with the LFRM_Ne model.
The MEV and SEP values for the CV model are very similar for the fixed region and ROI
approaches for all the states (see Tables 49 and 50). These results indicate that either the
ROI or fixed region approach is suitable for developing the CV model for use with the
LFRM_Ne model. For the validation of LFRM_Ne model in this study the CV model based
on ROI is used.
From the above analysis it is clear that if a MOM estimator were used to estimate the MEV
( 2ˆ ) for the CV model, the uncertainty would have been grossly underestimated as the
sampling error has heavily dominated the regional analysis. This would lead to an over
reliance on the regional model. A more reasonable estimate of the MEV has been achieved
in this study with the Bayesian MEV estimator as it represents the values of 2ˆ by
computing expectations over the entire posterior distribution. One can see that the
exponential prior used in the Bayesian analysis for this study has some influence on the
posterior distribution for the CV model. In the case of the CV and the ROI approach for
NSW (as shown in Figure 64), the posterior density function for the MEV is non-zero at the
origin, as would always be the case if 0 .
CHAPTER 8
251
Figure 64 Prior and posterior pdf's for the model error variance for CV (right) and the mean flood
(left) models for NSW state
8.8 VALIDATION
The prediction equations developed above using ROI approach, and Equation 8.3
(LFRM_Ne model), were applied to the 28 test catchments, which were not used in
developing the prediction equations. To make the comparison more useful and to
benchmark the LFRM_Ne model, the developed prediction equations were also used to
estimate the mean flood and CV with the PM (world) model developed by Majone et al.
(2007). It must be pointed out however, that the PM (world) model does not contain any of
the data used to develop the Australian LFRM. The validation analysis was undertaken for
ARIs up to 1000 years. ARIs in the range of 50 and 100 years were compared with at-site
flood frequency analysis (FFA) (obtained from the fitted LP3 distribution – see Chapter 3
for more details on this). Validating beyond the 100-year ARI with at-site FFA estimates
was not viewed as reliable given the very large extrapolation errors involved. Indeed, any
validation results obtained beyond the 100-year ARI would be of little significance for
most of the stations.
CHAPTER 8
252
For the larger ARIs (200, 500 and 1000 years), comparison was made against the results
obtained from another regional method where the parameters of the LP3 distribution (i.e.
mean, standard deviation and skew) were regressed against catchment characteristics
(known as the PRT - see Chapters 3 and 5 for more details) and flood quantiles were then
derived for the 200-, 500- and 1000-year ARIs. The extrapolation of these distributions to
the large ARIs also involves a large degree of uncertainty.
To assess how well the developed prediction equations approximate the observed flood
quantiles, two numerical measures were applied. Relative bias (BIASr – defined by
Equation 8.10) was used to assess whether the predicted flood quantiles by the LFRM_Ne
or PM (world) models systematically under- or overestimated the at-site FFA or the PRT
estimates on average, considering all the 28 test catchments.
100))((_
)())((_1BIAS
1
r
testn
i e
e
test worldPMNLFRM
PRTFFAworldPMNLFRM
n (8.10)
where testn represents the 28 test catchments used in the validation.
The relative error values (REr – defined by Equation 3.44, Chapter 3) with respect to the at-
site FFA or regional PRT estimate were also obtained. This is by no means the true error of
the LFRM_Ne or PM (world) models; the estimated errors represented here by both the
BIASr and REr may be taken as a reasonable indication of consistency of the LFRM_Ne or
PM (world) models as compared to FFA and PRT estimates. Here, both the FFA and PRT
estimates are associated with a higher degree of uncertainty due to considerable
extrapolation involved. It is worth noting here that in calculating the median relative error
(REr), the sign of the relative errors was ignored.
Table 51 summarises the various error statistics with the LFRM_N (i.e. no spatial
dependence) and LFRM_Ne models (considering the pooling of 1 max, 3 max and 5 max)
and the PM (world) model based on the 28 test catchments. If one ignores the issue of
spatial dependence in the Australian dataset one can see that the estimation for the ARI of
1000-years using the LFRM_N model suffers from minor underestimation on average (e.g.
BIASr of 1%) for the ungauged catchment case. Moreover, from Table 51, it can be seen
for 1 max and when the pooling of more data is undertaken (i.e. 3 max and 5 max), and
CHAPTER 8
253
spatial dependence (LFRM_Ne) is compensated for, the BIASr is well corrected. For
example from Table 51, for the 1000-year ARI, the BIASr for 1 max, 3 max and 5max and
LFRM_Ne are 5, 8 and 9% overestimation on average, respectively.
Focusing on the discussion for the 5 max results, for the ARIs of 50 to 1000-years, the
BIASr values are positive for both the LFRM_Ne and PM (world) models suggesting an
overestimation (on average) by both the models. When compared to the results of
preliminary LFRM models (i.e. Haddad et al. 2011b), the results obtained in this chapter
present a significant improvement. As found in Haddad et al. (2011b) the underestimation
on average was up to 40%). By pooling more data and also accounting for the inter-site
dependence in the LFRM model, the underestimation problem has been rectified. The
results as benchmarked against the PM (world) model are reassuring; this indeed places a
higher degree of confidence in the estimates given by the LFRM_Ne model developed here.
The REr values in Table 51 show acceptable results, which are comparable to similar
regional models for the smaller ARI ranges (see Chapter 5 and also Rahman et al., 2012).
Focusing on the 5 max results, the REr values range from 31% to 61% (which are also very
comparable to the PM (world) model), which suggest that the LFRM_Ne model performs
very well given the higher uncertainty associated with the larger ARI estimation using FFA
and PRT. It should be noted that in the PM (world) data set most of the stations were so
well separated that they were mostly independent of each other and this was the reason why
Majone et al. (2007) did not need to work out an effective number of sites. This may also
be the reason for the PM (world) model performing quite well in the validation here as
well. The LFRM_Ne model in this study has refined the approach of the PM (world) model
as significant inter-site dependence exists between stations in the Australian data set.
A confidence interval plot of the BIASr values is given in Figure 65 which displays the
central tendency and variability of the sample BIASr values. Figure 65 displays the mean
value (circle symbol) with a 95% confidence interval bar for flood quantiles 100 – 1000-
years ARIs. While the mean values appear to be different for the two methods (i.e.
LFRM_Ne and PM (world) models), the difference is not significant because the interval
bars overlap, suggesting that the LFRM_Ne model to be comparable to the PM (world)
model. Moreover, it proves that consistency is achieved for the 3 and 5 max pooling
CHAPTER 8
254
LFRM_Ne model as the mean values and the spread of BIASr values are very similar to the
PM (world) model.
Overall, the results generally show a good agreement between the estimates of the
LFRM_Ne/ PM (world) model and at-site FFA / PRT results. For the 1000-year ARI (5
max), the results can be regarded as ‘good’ for 20 out of the 28 test catchments and
‘acceptable’ for 2 test catchments and ‘poor’ for the remaining 6 test catchments. These
sorts of results are typical in Australian RFFA studies for the range of ordinary ARIs also
(e.g. 2 to 100 years).
It was also found that the catchments that showed under estimation were common for both
the methods. It is worth noting that the LFRM_Ne model on average always shows
overestimation relative to the PRT quantile estimates for some of the test catchments for
ARIs of 500- and 1000-years. This is a vast improvement as compared to the preliminary
LFRM model presented by Haddad et al. (2011b), where 17 out of the 18 test catchments
showed underestimation. The improvement in the results for the LFRM_Ne model
developed here may be attributed to the fact that the model pools more data and corrects for
the spatial dependence of the pooled standardised data. Indeed, taking into account the
degree of inter-station correlation has clearly reduced the negativte bias of the flood
quantile estimates notably. It is envisaged that as a part of the future assessment of the
LFRM_Ne model comparisons will be made against design flood estimates obtained by
alternative methods (e.g. spillway design and dam safety studies based on design rainfall
based approaches).
CHAPTER 8
255
Table 51 Summary of error statistics obtained from independent testing associated with the
LFRM model
1 max LFRM_N
ARI (years) BIASr (%) REr (%)
Model LFRM_N World Model LFRM_N World Model
50 30 39 53 60
100 12 23 54 61
200 12 26 29 33
500 6 24 34 30
1000 -1 19 38 32
1 max LFRM_Ne
ARI (years) BIASr (%) REr (%)
Model LFRM_Ne World Model LFRM_Ne World Model
50 47 39 57 60
100 25 23 62 61
200 23 26 29 33
500 14 24 31 30
1000 5 19 34 32
3 max LFRM_Ne
ARI (years) BIASr (%) REr (%)
Model LFRM_Ne World Model LFRM_Ne World Model
50 50 39 57 60
100 29 23 61 61
200 26 26 31 33
500 18 24 31 30
1000 8 19 35 32
5 max LFRM_Ne
ARI (years) BIASr (%) REr (%)
Model LFRM_Ne World Model LFRM_Ne World Model
50 51 39 57 60
100 30 23 61 61
200 28 26 31 33
500 19 24 32 29
1000 9 19 35 32
CHAPTER 8
256
Figure 65 Confidence interval plot of BIASr values with the LFRM_Ne and PM (world) models for the 28 test catchments
CHAPTER 8
257
8.9 SUMMARY
This chapter has developed and tested the performance of a new LFRM that also accounts
for spatial dependence in the AMFS data. This uses a comprehensive Australian AMFS
dataset that consisted of 654 stations.
To estimate the equivalent number of independent sites (Ne), a simple model was derived
that ignored possible variation with ARI. To be able to establish meaningful results
regarding spatial dependence, the analysis was also carried out on simulated datasets to
check the sampling and homogeneity issues. Overall, the experimental results showed that
spatial dependence decreased with larger network sizes generally and that some Australian
states exhibited a greater degree of spatial dependence than others. While there were
limitations with this analysis, a reasonable indication of the behavior of Ne was established.
The spatial dependence model was then generalised by developing an empirical
relationship between Ne and the average correlation coefficient in a network of the AMFS
data. To avoid inter-regional variation between the states, a general Australian spatial
dependence model was established. To be able to determine the functional form of the
spatial dependence model the analysis was carried out for the real and simulated datasets. It
was shown that both the real and simulated model coefficients were quite similar. It was
also illustrated that the scatter in the generalised spatial dependence model estimates
increased with increasing number of stations (N).
The LFRM was then revisited in the light of spatial dependence which was established with
the derived generalised spatial dependence model. By pooling the top 5 maxima and
correcting the plotting position points, the regional growth curves showed a shift upwards
and that the new LFRM (termed LFRM_Ne model henceforth) was seen to be considered
reliable up to the 3000-year ARI.
In the last few sections of this chapter the LFRM_Ne model was applied to the ungauged
catchment case where 28 test catchments not used in the development of the LFRM model
were used in the validation. This was achieved by developing regional regression equations
for mean flood and CV of the AMFS data as a function of catchment/ climatic
CHAPTER 8
258
characteristics. BGLSR (see Chapter 3 and 5 for details) and the ROI framework were used
to achieve this. It was found that the mean flood can be described by two predictors, which
were area and a representative design rainfall intensity. The CV showed no real
dependence with any predictors, and as such a regional average value was adopted for all
the states.
Finally, this chapter presented a validation which was undertaken to compare the flood
estimates from the LFRM_Ne model to those from established methods. For the estimation
up to the 100-year ARI the LFRM_Ne model results were compared to at-site flood
frequency analysis (FFA) results. For the larger ARIs (i.e. greater than 100-year ARI) they
were compared to estimates from the parameter regression technique. The LFRM_Ne model
was also bench marked against the world model (i.e. PM (world)) as established by Majone
et al. (2007). It was found that the LFRM_Ne models that pool 3 and 5 maxima were able
to estimate the 1000-year ARI flood quantile with only small positive bias on average, with
very acceptable median relative errors. When compared with the PM (world) model, the
LFRM_Ne model produces consistent results. A note is made here that the dataset used to
establish the LFRM is totally independent of the PM (world) dataset. Overall the results
from the LFRM_Ne model are considered to be an improvement over the results of the
preliminary LFRM model by Haddad et al. (2011b). This indeed presents a notable
improvement to the way large floods can be estimated by regional methods for ungauged
catchments in Australia and the world.
Slight underestimation still exists with the developed LFRM_Ne model for some the test
catchments, this is to be expected as any RFFA model generally cannot explain all the
variability found in the data given the simplicity of the RFFA approaches and the
asscoiated data errors. It is envisaged that further improvements and refinements can be
made in the future which are outlined in more detail in Chapter 9.
CHAPTER 9
259
CHAPTER 9: CONCLUSIONS
9.1 INTRODUCTION
This thesis focuses on design flood estimation problem in the ungauged catchments using
regional flood frequency analysis (RFFA). This, in particular, investigates the research
question of how flood quantile estimation in ungauged catchments can be improved by
adopting an ensemble of advanced statistical techniques. These techniques include
Bayesian generalised least squares regression (BGLSR), region of influence (ROI)
approach, Leave-one-out (LOO) and Monte Carlo cross validation (MCCV) procedures. A
large flood regionalisation model, which explicitly accounts for the spatial dependence in
the annual maximum flood series (AMFS) data in the regional flood modelling is also
proposed and investigated. The thesis also emphasises the importance on the collation of a
quality-controlled flood database and the issue of uncertainty estimation in RFFA methods.
Design flood estimation in the range of frequent to medium (2 – 100 years) and large to
rare (>100 to 2000 years) average recurrence intervals (ARI) is frequently required in the
design of many engineering works such as design of canals, spillways, dams, bridges, water
intakes, land use planning and flood insurance studies. These sorts of infrastructure works
and investigations are of notable economic significance, as highlighted in Chapter 1.
Traditionally, there have been several methods that are frequently adopted for these tasks.
For the frequent to medium floods, the most commonly adopted RFFA methods for small
to medium sized ungauged catchments include the probabilistic rational method (PRM), the
index flood method (IFM) and the quantile regression technique (QRT). In south–east
Australia, the PRM was recommended for general use in Australian Rainfall and Runoff
(ARR), mainly due to its simplistic nature and ease-of-use in application (I.E. Aust., 1987).
This thesis advocates the use of regression-based RFFA methods under the BGLSR
framework rather than PRM. The BGLSR has been developed and tested with the QRT and
the parameter regression technique (PRT). In forming the regions, both the fixed region and
ROI approaches have been examined in the range of frequent to medium ARI floods. The
detailed validation of the regional hydrological regression models has also been undertaken
using the popular LOO validation and the relatively new MCCV procedures.
CHAPTER 9
260
In addition, a simple LFRM that accounts for spatial dependence in the AMFS data for
estimating large to rare floods at both gauged and ungauged sites has been developed. The
new LFRM is easy to use and offers an alternative to the traditional rainfall-based methods.
While summaries of the various modelling, development and testing tasks have been
provided at the end of each chapter of the thesis, an overview and the major findings of the
thesis are presented below.
9.2 OVERVIEW OF THE STUDY
9.2.1 DATA SELECTION (CHAPTER 4)
Initially, over 1000 stations across the Australian continent were selected for the study
based on a number of criteria, such as catchment size, streamflow record length,
streamflow data quality, degree of regulation, urbanisation and land use change. Further
examination indicated that many of these stations did not satisfy the criteria of
homogeneity and representativeness for the purpose of RFFA. Moreover, to reduce the
potential effects of inter-decadal variability, the minimum length of records (after infilling
of missing records) was increased up to 25 years where possible. This was necessary due to
the presence of a long drought that affected many stations after the late 1980s. The stations
that suffered from excessive error, due to rating curve extrapolation, were excluded.
Finally, a total of 682 catchments were selected for the study. These catchments are mainly
rural with no known major land use changes over the periods of streamflow records.
An outlier test was conducted for each of the selected stations. The influence of errors on
flood frequency curves from the extrapolation of rating curves was minimised by placing
limits on the degree of extrapolation involved in estimating the largest observed flood
events using the in-built tool in the FLIKE software (which implements the principles
outlined in Kuczera, 1999a, b). A total of 8 catchment characteristics that are perceived to
mainly govern the flood generation process and are relatively easy to obtain were selected
for this study. These catchment characteristics data were extracted for each of the selected
catchment (refer to Chapter 4 for more details).
CHAPTER 9
261
9.2.2 RFFA IN THE FREQUENT TO MEDIUM ARI RANGE (CHAPTER 5)
Flood prediction equations were developed and compared for the states of New South
Wales (NSW), Victoria, Queensland and Tasmania (for ARIs of 2, 5, 10, 20, 50 and 100
years). Both the fixed region and ROI approaches in the QRT and PRT frameworks were
adopted, where the quantiles and parameters (i.e. mean, standard deviation and skew) of the
log Pearson Type 3 (LP3) distribution were regressed against the selected set of climatic
and catchment characteristics variables. The BGLSR procedure was adopted for the
estimation of the regression coefficients. The developed prediction equations (i.e regression
coefficients) were assessed in the ungauged catchment case by adopting a LOO validation
procedure.
9.2.3 MCCV VS LOO (CHAPTER 6)
Selecting the right regression model and ascertaining its predictive power are important
steps in any regional hydrologic regression analysis, which are usually undertaken by some
kind of validation e.g. split sample validation. This thesis assessed the performances of the
most commonly adopted LOO validation against the relatively new MCCV procedure.
Both the validation procedures (i.e. LOO and MCCV) were carried out under the the
ordinary least squares regression (OLSR) and GLSR frameworks for the estimation of
flood quantiles using simulated and regional flood data from the state of NSW in Australia.
9.2.4 LARGE TO RARE FLOOD ESTIMATION (CHAPTERS 7 and 8)
An overview of inter-site dependence in the Australian AMFS data was discussed.
Determination of homogenous regions and the identification of an appropriate probability
distribution were investigated and discussed in the context of the LFRM. The issues
relating to concurrent record lengths for the establishment of meaningful networks to carry
out the analysis of spatial dependence was presented. The theory of inter-site dependence
and the estimation of the number of independent sites using a simple model were derived.
Finally, the methodology underpinning the LFRM was developed for the Australian
continent and was applied with the developed spatial dependence model coupled with the
BGLSR. Here, the BGLSR was used to develop the prediction equations for the mean and
coefficient of variation (CV) of the annual maximum flood series data. The LFRM was
developed and tested to estimate large to rare floods for both the gauged and ungauged
CHAPTER 9
262
catchment case. A split-sample validation was also carried out to compare the results of the
LFRM with the established methods such as the PRT (refer to Chapters 3 and 5 for more
details) and the international method (i.e. World Model).
9.3 CONCLUSIONS
9.3.1 DESIGN FLOOD ESTIMATION IN THE FREQUENT TO MEDIUM ARI
RANGE
It has been found that the ROI performs better than the fixed region approach in
RFFA. Hence, the ROI approach should be used where there are enough
geographically contiguous gauged catchments in a state/region.
It has been found that the Bayesian GLSR is preferable to OLSR in developing the
prediction equations for flood quantiles and flood statistics.
It has been found that the QRT-ROI and PRT-ROI perform very similarly. Hence,
the PRT is a viable alternative to QRT for design flood estimation in ungauged
catchments. The developed RFFA methods based on the QRT-ROI and PRT-ROI
allow design flood estimation along with its associated uncertainty (in the form of
confidence limits) given the relevant catchment characteristics data for the gauged
or ungauged catchment of interest.
It has been found that catchment area and design rainfall intensity are adequate for
the estimation of the flood quantiles with the QRT. Furthermore, catchment area,
design rainfall intensity, mean annual evaporation, mean annual rainfall, main
stream slope and forest cover are needed in the PRT for the estimation of the second
and third parameters of the LP3 distribution.
LOO validation indicates that the ROI based on the minimisation of the predictive
uncertainty leads to more efficient and accurate flood quantiles estimates by both
the QRT and PRT. The regression diagnostics reveal that the catchment
characteristics variables alone may not pick up all the heterogeneity in the regional
model. Both the BGLSR based QRT-ROI and PRT-ROI methods show
improvements in regional heterogeneity with an increase in the average pseudo
CHAPTER 9
263
R2
GLS and a decrease in the model error variance, average variance of prediction and
the average standard error of prediction.
Both the standardised residual and quantile-quantile plots of the ROI analysis
satisfied the underlying model assumptions better than the fixed region regression.
It has been found that both BGLSR QRT-ROI and PRT-ROI produce smaller
average relative root mean square errors and median relative errors when compared
to the fixed region regression approach. Based on the evaluation statistics overall it
has been found that there are only modest differences between the BGLSR QRT-
ROI and PRT-ROI.
9.3.2 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION MODELS
From the simulation and real data examples, it has been found that when developing
regional hydrologic regression models, application of GLSR based MCCV
validation procedure is likely to result in the most parsimonious model as opposed
to the OLSR based LOO, OLSR based MCCV and GLSR based LOO validation
procedures.
The GLSR based MCCV has been found to exhibit the smallest mean squared
errors of prediction and also has fewer instances of problems with collinearity of
predictor variables as compared to the OLSR LOO and OLSR MCCV validation
procedures.
It has also been found that the MCCV and corrected MCCV (CMCCV) can
provide more reasonable estimate of a model’s predictive ability than LOO and that
the CMCCV has the potential to offer reasonable improvement over the MCCV in
estimating the prediction ability of a regional hydrologic regression model.
9.3.3 LARGE TO RARE FLOOD ESTIMATION
The development and application of a simplified LFRM that pools the top 3 and 5
annual maximum flood values from member sites in a region, coupled with the
CHAPTER 9
264
BGLSR and a newly developed spatial dependence model have been established for
Australia.
A simple model for the effective number of independent stations (Ne) has been
developed that ignores possible variation with ARI. Meaningful results regarding
spatial dependence are established by undertaking the analysis on simulated
datasets to counteract sampling and homogeneity issues.
Overall, the experimental results of the analysis show that, in general, spatial
dependence decreases with larger network size and that some Australian states have
more spatial dependence than others. While there are some limitations with this
analysis, a reasonable indication of the behaviour of Ne has been established.
Using the derived generalised spatial dependence model, the LFRM has been
corrected for spatial dependence by correcting the plotting position points of the
LFRM frequency distribution curve and as such the regional growth curves all show
a shift upwards.
Finally, the LFRM has been applied to the ungauged catchment case. An
independent validation shows that the developed LFRM is able to estimate design
floods for 100 to 1000 years ARIs with reasonable confidence as compared to the
world model.
Overall, the newly developed LFRM coupled with BGLSR and a
spatial dependence model offers a powerful yet simple method of regional flood
estimation for floods of large to rare ARIs.
9.4 LIMITATIONS AND SUGGESTIONS FOR FUTURE RESEARCH
The RFFA methods for the frequent ARIs developed in this study were based on the flood
database available in eastern Australia up to the years 2004/2005. It is expected that
availability of a more comprehensive database (in terms of both quality and quantity) will
further improve the predictive performance of both the fixed and ROI based RFFA
methods presented in this study, which however needs to be investigated in future when
CHAPTER 9
265
such a database is available. Also and accordingly with the availability of a more
comprehensive database further research should be directed to looking at the effects of
climate change in the developed RFFA model.
In the case of BGLSR – QRT or PRT approaches, most of the uncertainty can be accounted
for on the left hand side of the equation, i.e. the dependant variable. In most cases the
predictor variables (e.g. design rainfall) are also subject to various errors (sampling,
measurement and model errors). There has been no study on the effects of this error on
regional flood estimates. Therefore, the design flood estimates obtained in this thesis may
be biased in terms over estimating the model error variance, leading to uncertain regression
coefficients and statistical diagnostics which rely on the model error variance such as the
standard error of prediction and average variance of prediction.
In the conventional approach of RFFA using regression based procedures such as the
BGLSR-QRT or PRT approaches the predictor variables (for example design rainfall
intensity) that are statistically significant are chosen according to some goodness-of-fit
measure. The resulting regression relationship, along with the chosen predictor variables is
believed to be the “true” form of the model. In principle, this assumption is imperfect and
not satisfied with respect to:
(i) the predictor variables in the analysis are fixed (i.e. not random i.e. are assumed
not to be probability distributed); and
(ii) the predictor variables (e.g. design rainfall intensity) have underlying errors
(sampling, model and measurement errors, which are often ignored in the
analysis).
Firstly, the assumption of fixed predictor variables may not be satisfied in a hydrological
context. For example we would like to estimate the 10 year ARI flood quantile, we first
estimate values for the predictor variables (e.g. area, rainfall intensity and slope) and then
estimate the flood quantile Q10. In this case, the analysis chooses the fixed values for the
predictor variables. This approach is not considered to be a random outcome as outlined in
Koop (2008).
CHAPTER 9
266
Secondly, it is assumed that the predictor variables are error free. However, the predictor
variables being used in our RFFA study such as the design rainfall intensity values
published in Australian Rainfall and Runoff (ARR) (I.E Aust., 1987) is likely to suffer
from a great deal of uncertainty/error, e.g., these were estimated based on a limited rainfall
dataset, with many stations having very short records. The rainfall intensity estimates were
fitted with the LP3 distribution using the method of moment’s estimator. Thus, the
estimates were subject to a variety of errors (e.g. sampling variability, model error) which
may contribute to the overall errors in the final flood quantile estimates.
To this end, some specific questions impose themselves such as: What improvements can
be gained from including all the possible error (both dependent and predictor variables) in
the analyses on the final flood quantiles? How can this uncertainty in RFFA be quantified
and used to develop confidence limits with flood quantile estimates? As can be seen this is
a very involved area of research and as such was beyond the scope of this thesis. However,
it is recommended that this new research be undertaken as it will provide a new dimension
to our understanding of the uncertainty and errors in design flood estimation using RFFA.
The above future research suggestions may also be implemented in a hierarchical ROI
framework that includes dependence on exogenous covariates.
The following recommendations are made to further improve the LFRM method for large
to rare flood estimation:
A sensitivity analysis of the LFRM estimates to the number of selected highest
floods in each region or network should be investigated further based on a more
theoretical basis.
As a precursor to any analysis such as the LFRM, further data simulations should be
undertaken to determine the effects on estimated large flood values of any
violations of the basic assumptions on homogeneity and distribution (very useful in
this case as strict homogeneity was not satisfied and only the GEV distribution was
used (i.e. for derivation of the spatial dependence model)).
The search for a more appropriate form of the constant Ne model, or even the
introduction of a variable Ne model (e.g. variable with ARI) to estimate the effective
CHAPTER 9
267
number of independent stations should be identified based on more theoretical
investigations.
The influence of the constant inter-site correlation assumption in the simulated data
which was also used to identify the functional form of the generalised spatial
dependence ‘constant Ne model’ should be examined more closely by using a wider
range of constant correlations. A ‘variable Ne model’ should also be examined in
this framework. As such, the use of Multivariate Copulas is recommended (e.g.
Favre et al., 2004).
The uncertainties in the LFRM should also be investigated using a Monte Carlo
simulation method. The main sources of errors in the LFRM estimation are mainly
introduced through parameter estimation errors in the constant Ne model and in the
fitting of the LFRM distribution by the mean flood and CV of annual maximum
flood series.
Also further validation, analysis and testing should include deriving uncertainty
limits and comparing the LFRM estimates to those obtained from rainfall runoff
modelling.
The steps outlined for future research on the uncertainty in regional flood estimates in the
range of 2 – 100 years ARI and the LFRM involve considerable time and effort and were
considered to be beyond the scope of this thesis.
REFERENCES
268
REFFRENCES
Acreman, M.C., Sinclair, C.D., 1986. Classification of drainage basins according to their physical characteristics: an application for flood frequency analysis in Scotland. J. Hydrol. 84, 365-380. Acreman, M.C., 1987. Regional flood frequency analysis in the UK: Recent research-new ideas. Rep. Inst. of Hydro. Wallingford, UK. Acreman, M.C., Wiltshire, S.E., 1987. Identification of regions for regional flood frequency analysis. EOS. 68 (44), 1262 (Abstract). Ahmad, M.I., Sinclair, C.D., Werrity, A., 1988. Log – logistic flood frequency analysis. J. Hydrol. 98, 205-224. Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Autom. Cont. 19 (6), 716-722. Alexander, G.N., 1954. Some aspects of time series in hydrology. J. Inst. Eng. Aust. 26, 188-198. Alila, Y.P., Adamowski, K., Pilon, J., 1992. Regional homogeneity testing of low-flows using L moments. In: Proceedings of 12th conference on probability and statistics in the Atmospheric sciences, 5th International Meeting on Statistical Climatology, Toronto, Ont., 22-26 June, 1992. Anderson, H.W., 1957. Relating sediment yield to watershed variables. Trans. Am. Geophys. Union. 38, 921-924. Ashkanasy, N.M., 1985. To Bayes or not to Bayes – The future direction of statistical approaches in hydrology. Hydrology and Water Resources Symposium, 1985, Sydney, 14-16 May. Baratti, E., Montanari, A., Castellarin, A., Salinas, J.L., Viglione, A., Bezzi, A., 2012. Estimating the flood frequency distribution at seasonal and annual time scale. Hydrol. Earth Syst. Sci. 9, 7947-7967. Bates, B.C., 1994. Regionalisation of hydrological data: A review. Report 94/5. CRC for Catchment Hydrology, Monash University, Australia, pp 61. Bates, B.C., Rahman, A., Mein, R.G., Weinmann, P.E., 1998. Climatic and physical factors that influence the homogeneity of regional floods in south-eastern Australia. Water Resour. Res. 34 (12), 3369-3382. Benson, M.A., 1959. Channel slope factor in flood frequency analysis. J. Hydraul. Div. ASCE, 85, (HY4), 1-19. Benson, M.A., 1962. Evolution of methods for evaluating the occurrence of floods. U.S. Geol Surv. Water Supply Paper, 1580-A, 30pp.
REFERENCES
269
Benson, M.A., 1968. Uniform flood frequency estimating methods for federal agencies. Water Resour. Res. 4 (5), 981-908. Benson, M. A., Matalas, N.C., 1967. Synthetic hydrology based on regional statistical parameters. Water Resour. Res. 3 (4), 931-935. Bernier, J., 1967. Sur la thėorie du renouvellement et son application en hydrologie. Electricitė de France, Hyd 67 (10), 32. (in French) Bobėe, B., Cavidas, G., Ashkar, F., Bernier, J., Rasmussen, P., 1993. Towards a systematic approach to comparing distributions used in flood frequency analysis. J. Hydrol. 142, 121-136. Bocchiola, D., De Michele, C., Rosso, R., 2003. Review of recent advances in index flood estimation. Hydrol. Earth Syst. Sci. 7(3), 283-296. Brath, A., Castellarin, A., Montanari., A., 2003. Assessing the reliability of regional depth-duration-frequency equations for gauged and ungauged sites. Water Resour. Res. 39 (12), 1367, doi:10.1029/2003WR002399. Breiman, L., Freidman, J.H., Olsen, R.A., Stone, C., 1984. Classification and regression trees. Wadsworth: Belmont, CA. Buishand, T.A., 1984. Bivariate extreme-value data and the station-year method. J. Hydrol. 69, 77-95. Bunke, O., Droge. B., 1984. Bootstrap and cross-validation estimates of the prediction error for linear regression models. Annal. Statist. 12 (4), 1400-1424. Bureau of Meteorology, 2012. State of the Climate 2012. Australian Bureau of Meteorology and CSIRO. Burman, P.A., 1989. A comparative study of ordinary cross validation, v-fold cross-validation and repeated learning-tested methods. Biometrika. 76, 503-514. Burn, D.H., 1990a. An appraisal of the “region of influence” approach to flood frequency analysis. Hydrol. Sci. J. 35 (2), 149-165. Burn, D.H., 1990b. Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res. 26 (10), 2257-2265. Calenda G., Mancini C.P., Volpi, E., 2009. Selection of the probabilistic model of extreme floods: The case of the River Tiber in Rome. J Hydrol. 27, 1-11. Casella, G., George, E.I., 1992. Explaining the Gibbs sampler. Amer. Statist. Assoc. 46 (3), 167-174. Castellarin, A., 2007. Probabilistic envelope curves for design flood estimation at ungauged sites, Water Resour. Res. 43, W04406, doi:10.1029/2005WR004384. Castellarin, A., Vogel, R.M., Matalas, N. C., 2005. Probabilistic behaviour of a regional envelope curve. Water Resour. Res. 41, W06018, doi:10.1029/2004WR003042.
REFERENCES
270
Castellarin, A., Vogel, R.M., Matalas, N.C., 2007. Multivariate probabilistic regional envelopes of extreme floods, J. Hydrol. 336, 376-390 Castellarin, A., Merz, R., Blöschl, G., 2009. Probabilistic envelope curves for extreme rainfall events. J. Hydrol. 378, 263-271. Castiglioni, S., Castellarin, A., Montanari, A., 2009. Prediction of low-flow indices in ungauged basins through physiographical space-based interpolation. J. Hydrol. 378, 272-280, doi:10.1016/j.jhydrol.2009.09.032. Chebana, F., Ouarda, T.B.M.J., 2008. Depth and homogeneity in regional flood frequency analysis. Water Resour. Res. 44 (11), W11422, doi:10.1029/2007WR006771. Chow, V.T., Maidment, D.R., Mays, L.W., 1988. Applied Hydrology. McGraw-Hill, USA. Chowdhury, J.U., Stedinger, J.R., Lu, L.-H., 1991. Goodness of fit tests for regional flood distributions. Water Resour. Res. 27 (7), 1765-1776. Chowdhury, S., Sharma A., 2009. Multi-site seasonal forecast of arid river flows using a dynamic model combination approach. Water Resour. Res. 45, W10428, doi:10.1029/2008WR007510. Cohn. T.A., Lane, W.L., Baier, W.G., 1997. An algorithm for computing moments based flood quantile estimates when historical flood information is available. Water Resour. Res. 33 (9), 2089-2096. Coles, S., 2001. An introduction to statistical modelling of extreme values. London: Springer. Congdon, P., 2001. Bayesian statistical modelling, John Wiley & Sons, Ltd, West Sussex. Cooley, D., Naveau, P., Poncet, P., 2006. Variables for spatial max-stable random fields. Chapter of the book Statistics for dependant data (Lecture Notes in Statistics, Springer) doi: 10.1007/0-387-36062-X_17. Cooley D., Davis R., Naveau P., 2010. The pairwise Beta distribution: A flexible parametric multivariate model for extremes. J. Mult. Anals. 101, 2103-2117. Cunderlik, J.M., Burn, D.H., 2003. Non-stationary pooled flood frequency analysis. J. Hydrol. 276, 210-223. Cunnane, C., 1988. Methods and merits of regional flood frequency analysis. J. Hydrol. 100: 269-290. Cunnane, C., 1989. Statistical Distributions for Flood Frequency Analysis. World Meteorological Organisation, Operational Hydrology Report. No 33. D'Agostino, R.B., Stephens, M.A., 1986. Goodness-of-Fit Techniques. Marcel Dekker, Inc. New York.
REFERENCES
271
Dales, M.Y., Reed, D.W., 1989. Regional flood and storm hazard assessment. Rep. No. 2, Institute of Hydrology, Wallingford, Oxon, UK. Dalrymple, T., 1960. Flood frequency analysis, Water Supply Paper 1543-A. U.S Geological Survey, Reston, VA. Dawdy, D.R., 1961. Variation of flood ratios with size of drainage area. U. S. Geol. Surv. Prof. Pap. 424-C, Paper C36. Dawdy, D.R., Griffis, V.W., Vijay, G., 2012. Regional flood-frequency analysis: How we got here and where we are going. J. Hydrol. Eng. 17, 953-959. Di Baldassarre, G., Castellarin, A., Brath A., 2006. Relationships between statistics of rainfall extremes and mean annual precipitation: an application for design-storm estimation in northern Italy. Hydrol. Earth Syst. Sci. 10, 589-601. Douglas, E.M., Vogel, R.M., 2006. The probabilistic behaviour of the flood of record in the United States. J. Hydrol. Eng. 11 (5), 482-488. Draper, N.R., Smith, H., 1981. Applied regression analysis, 2nd ed. John Wiley, New York. Dymond, J.R., Christian, R., 1982. Accuracy of discharge determined from a rating curve. Hydrol. Sci. J. 27, 493-504.
Efron, B., 1983. Estimating the error rate of a prediction rule.: Some improvements on cross-validation. J. Amer. Stat. Assoc. 78, 316-331. Efron, B., 1986. How biased is the apparent error rate of the prediction rule? J. Amer. Stat. Assoc. 81, 461-470. El Adlouni, S., Bobee, B., Ouarda, T.B.M.J., 2008. On the tails of extreme event distributions in hydrology. J. Hydrol. 355, 16-33. Eng, K., Tasker, G.D., Milly, P.C.D., 2005. An analysis of Region-of-Influence methods for flood frequency regionalisation in the Gulf-Atlantic rolling plains. J. Am. Water Resour. Assoc. 41 (1), 135-143. Eng, K., Milly, P.C.D., Tasker, G.D., 2007a. Flood regionalization: A hybrid geographic and predictor-variable Region-of-Influence regression method. J. Hydrol. Eng. 12 (6), 585-591. Eng, K., Stedinger, J.R., Gruber, A.M., 2007b. Regionalisation of streamflow characteristics for the Gulf-Atlantic Rolling Plains using Leverage-Guided Region-of-Influence regression. In: EWRI World Water & Environmental Resources Congress, American Society of Civil Engineers 2007. Faber, K., Kowalski, B.R., 1997. Propagation of measurement errors for the validation of prediction obtained by principal component regression and partial least squares. J. Chemo. 11, 181 – 238.
REFERENCES
272
Favre, A.C., El Adlouni, S., Perreault, L., Thiémonge, N., Bobée, B., 2004. Multivariate hydrological frequency analysis using copulas. Water Resour. Res. 40 (1), W01101. Feaster, T.D., Tasker, G.D., 2002. Techniques for estimating the magnitude and frequency of floods in rural basins of South Carolina, 1999. Water Resources Investigations Report 02-4140, U.S. Geological Survey: Columbia, South Carolina. Ferrari, E., Gabriele, S., Villani, P., 1993. Combined regional frequency analysis of extreme rainfalls and floods. Extreme Hydrological Events: Precipitation, floods and droughts. In Proc, Yokohama Symposium, July, 1993). IAHS Publ. no. 213. Fill, D.H., Stedinger J.R., 1995a. L moment and PPCC goodness-of-fit tests for the Gumbel distribution and effect of autocorrelation. Water Resour. Res. 31 (1), 225-229. Fill, D.H., Stedinger J.R., 1995b. Homogeneity tests based upon Gumbel distribution and a critical appraisal of Darymple’s test. J. Hydrol. 166, 81-105. Fill, H.D., Stedinger, J.R., 1998. Using regional regression within IF procedures and an empirical Bayesian estimator. J. Hydrol. 210, 128-145. Flavell, D.J., 1982. The rational method applied to small rural catchments in the south west of Western Australia, Hydrology and Water Resour. Symp, 49-53. Flavell, D.J., 1985. Australian Rainfall and Runoff revision. Civil College Tech. Report, Engineers Australia, 6 Sep 1985, pp. 1-4. Flavell, D.J., Belstead, B.S., 1986. Losses for design flood estimation in Western Australia, Hydrology and Water Resour. Symp. Fortin, V., Bernier, J., Bobée, B., 1997, Simulation, Bayes, and bootstrap in statistical hydrology Water Resour. Res. 33, (3), 439–448. doi:10.1029/96WR03355. Franks, S.W., Kuczera, G., 2002. Flood frequency analysis: evidence and implications of secular climate variability, New South Wales. Water Resour. Res. 38(5), 1062, (doi:10.1029/2001WR000232). French, R., 2002. Flaws in the rational method. 27th National Hydrology and Water Resources Symp. 20-23 May, Melbourne. Gaál, L., Kyselý, J., Szolgay, J., 2008. Region-of-influence approach to a frequency analysis of heavy precipitation in Slovakia. Hydrol. Earth Syst. Sci. 12, 825-839. Galea, G., Michel, C., Oberlin, G., 1983. Maximal rainfall on a surface – the epicentre coefficient of 1 to 48-hour rainfall. J. Hydrol. 66, 159-167. Gamble, S.K., Turner, K., Smythe, C., 1998. Application of the focussed rainfall growth estimation technique in Tasmania. Hydro Electric Corporation, Tasmania, Internal Report.
REFERENCES
273
Gaume, E., Gaál, L., Viglione, A., Szolgay, J., Kohnová, S., Blöschl., G., 2010. Bayesian MCMC approach to regional flood frequency analyses involving extraordinary flood events at ungauged sites. J. Hydrol. 394 (1-2), 101-117. doi:10.1016/j.jhydrol.2010.01.008. Geman, S,, Geman, D., 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence. 6, 721-741 Greene, W.H., 2003. Econometric Analysis. Prentice Hall, NJ. Griffis, V.W., Stedinger, J.R., 2004. LP3 Flood quantile estimators using at-site and regional information, Critical Transitions in Water and Environmental Resources Management, Proceedings World Water & Environmental Resources Congress, Salt Lake City, Utah, June 27 - July 1, 2004. Edited by G. Sehlke, D.F., Hayes and D. K. Stevens, ASCE, Reston, Virginia. Griffis, V.W., Stedinger, J.R., 2007. The use of GLS regression in regional hydrologic analyses. J.Hydrol. 204, 82-95. Grubbs, F.E., Beck, G., 1972. Extension of sample sizes and percentage points for significance tests of outlying observations. Technometrics. 4 (14), 847-853. Gruber, A.M., Stedinger, J.R., 2007. Models of regional skew based on Bayesian GLS regression. In: World Environmental & Water Resources Conference, Tampa, Florida, May 15 – 18, 2007. Gruber, A.M., Stedinger, J.R., 2008. Models of LP3 Regional skew, data selection, and Bayesian GLS regression. In: EWRI World Water & Environmental Resources Congress, American Society of Civil Engineers, Honolulu, HI, May 13-16, 2008. Guse, B., Castellarin, A., Thieken, A.H., Merz, B., 2009. Effects of intersite dependence of nested catchment structures on probabilistic regional envelope curves. Hydrol. Earth Syst. Sci. 6, 2845-2892. Guttman, N.B., 1993. The use of L-moments in the determination of regional precipitation climates. J, Clim. 6, 2309-2325. Hackelbusch, A., Micevski, M., Kuczera., G, Rahman, A., Haddad, K., 2009. Regional flood frequency analysis for eastern NSW: A region of influence approach using generalised least squares log-Pearson 3 parameter regression. In: 32nd Hydrology and Water resources Symp, Newcastle, 30th Nov to 3rd Dec 2009. Haddad, K., 2008. Design Flood Estimation in Ungauged Catchments Using a Quantile Regression Technique: Ordinary and Generalised Least Squares Methods Compared for Victoria, Masters (Honors) thesis, School of Engineering, The University of Western Sydney, New South Wales. Haddad, K., Rahman, A., 2008. Investigation on at-site flood frequency analysis in south-east Australia, IEM Journal, The J. Inst. Eng., Malaysia, 69 (3), 59-64.
REFERENCES
274
Haddad, K., Rahman, A., Weinmann, P.E., Kuczera, G., Ball, J.E., 2010a. Streamflow data preparation for regional flood frequency analysis: Lessons from south-east Australia. Aust. J. Water Resour. 14 (1), 17-32. Haddad, K., Zaman, M. and Rahman, A., 2010b. Regionalisation of skew for flood frequency analysis: a case study for eastern NSW. Aust. J. Water Resour., 14, 1, 33-41. Haddad. K., Rahman, A., 2011. Selection of the best fit flood frequency distribution and parameter estimation procedure: a case study for Tasmania in Australia. Stoch. Env. Res. Risk A. 25 (3), 415-428, doi: 10.1007/s00477-010-0412-1 Haddad, K., Rahman, A., Green, J., 2011a. Design rainfall estimation in Australia: A case study using L moments and generalized least squares regression. Stoch. Environ. Res. Risk Assess. 25 (6), 815-825. doi:10.1007/s00477-010-0443-7. Haddad, K., Rahman, A., Weinmann, P.E., 2011b. Estimation of major floods: applicability of a simple probabilistic model. Aust. J. Water Resour. 14 (2), 117-126. Haddad, K., Rahman, A., Kuczera, G. 2011c. Comparison of ordinary and generalised least squares regression models in regional flood frequency analysis: A case study for New South Wales. Aust. J. Water Resour. 15 (2), 1-12. Haddad, K., Rahman, A., 2012. Regional flood frequency analysis in eastern Australia: Bayesian GLS regression-based methods within fixed region and ROI framework: Quantile regression vs. parameter regression technique. J. Hydrol. 430-431, 142-161. Haddad, K., Rahman, A., Stedinger, J. R., 2012. Regional flood frequency analysis using Bayesian generalized least squares: a comparison between quantile and parameter regression techniques. Hydrol. Process. 26, 1008–1021. doi: 10.1002/hyp.8189. Hardison, C.H., 1971. Prediction error of regression estimates of streamflow characteristics at ungauged sites. U.S. Geol. Pap. 750-C, C228-C236. Hastings, W.K., 1970. Monte Carlo sampling methods using markov chains and their applications. Biometrika. 57, 97-109. Hewa, G.A., McMahon, T.A., Peel, M.C., Nathan, R.J., 2003. Identification of the most appropriate regression procedure to regionalise extreme low flows. 28th Intl. Hydrology and Water Resour. Symp. 10-13 Nov. 2003. Hosking, J.R.M., 1990. L moments: analysis and estimation of distributions using linear combinations of order statistics. J.R. Statist. Soc. Ser. B, 52 (1), 105-124. Hosking, J.R.M., Wallis, J.R., 1988. The effect of intersite dependence on regional flood frequency analysis. Water. Resour. Res. 24 (4), 588-600. doi: 10.1029/WR024i004p00588. Hosking, J.R.M., Wallis, J.R., 1991. Some statistics useful in regional frequency analysis. IMB Math. Res. Rep. RC 17096, IMB T.J. Watson Research Center, Yorktown Heights, N.Y., 23 pp. Hosking, J.R.M., Wallis, J.R., 1993. Some statistics useful in regional frequency analysis. Water Resour. Res. 29 (2), 271-281.
REFERENCES
275
Hosking, J.R.M., Wallis, J.R., 1997. Regional frequency analysis: an approach based on L-moments. Cambridge University Press, New York. Hosking, J.R.M., Wallis, J.R., Wood, E.F., 1985. An appraisal of the regional flood frequency procedure in the UK Flood Studies Report. Hydrol. Sci. J. 30, 85-109. Houghton, J.C., 1978. Birth of a parent: The Wakeby distribution for modelling flows. Water Resour. Res. 14 (6), 1105-1115. Iacobellis, V., Gioia, A., Manfreda, S., Fiorentino, M., 2011. Flood quantiles estimation based on theoretically derived distribtions: regional analysis in Southern Italy. Nat. Hazards Earth Syst. Sci. 11, 673-695, doi:10.5194/nhess -11-673-2011. Institution of Engineers Australia (I. E. Aust.) 1987, 2001. Australian Rainfall and Runoff: A Guide to Flood Estimation. Edited by D. H. Pilgrim, Vol. 1, I. E. Aust., Canberra. Interagency Advisory Committee on Water Data (IAWCD). 1982. Guidelines for Determining Flood Flow Frequency: Bulletin17-B (revised and corrected). Hydrol. Subcomm., Washington, DC, March 1982, pp. 28. IPCC 2007. The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC). Ishak, E., Haddad, K., Zaman, M., Rahman, A., 2011. Scaling property of regional floods in New South Wales Australia. Nat. Haz. J. doi:10.1007/s11069-011-9719-6. Ishak, E., Rahman. A., Westra, S., Sharma, A., Kuczera, G., 2013. Evaluating the non-stationarity of Australian annual maximum flood. J. Hydrol. Accepted. Jennings, M.E., Thomas, Jr., W.O., Riggs, H.C., 1994. Nationwide summary of U.S. Geological Survey Regional regression estimates for estimating magnitude and frequency of floods for ungaged sites, Water Resources Investigations Report 94-4002, U.S. Geological Survey: Reston, Virginia. Jin, M., Stedinger, J.R., 1989. Flood frequency analysis with regional and historical information. Water Resour. Res. 25 (5), 925-936. Johnston, J., 1972. Econometric Methods, Mc-Graw Hill, New York. Jothityangkoon, C., Sivapalan, M., 2003. Towards estimation of extreme floods: examination of the roles of runoff process changes and floodplain flows. J Hydrol. 281, 206-229. Juckem, P.F., Hunt, R.J., Anderson, M. P., Robertson, D. M., 2008. Effects of climate and land management change on streamflow in the driftless area of Wisconsin, J. Hydrol. 355, 123–130. Juraj, M., Ouarda, T.B.M.J., 2007. Regional flood-rainfall duration-frequency modeling of small ungaged sites. J. Hydrol. 345, 61-69, doi:10.1016/j.jhydrol.2007.07.011.
REFERENCES
276
Katz, R.W., Parlange, M.B., Naveau, P., 2002. Statistics of extremes in hydrology. Adv. Water Resour. 25, 1287-1304. Kendall, M.G., 1970. Rank correlation methods. 4th Edition, Griffen, London, 202 p. Khaliq, M.N., Ouarda, T.B.M.J., Ondo, J.-C., Gachon, P., Bobée, B., 2006. Frequency analysis of a sequence of dependent and/or non-stationary hydro- meteorological observations: A review. J. Hydrol. 329, (3-4), 534-552. Kidson, R., Richards, K.S., 2005. Flood frequency analysis: assumptions and alternatives. Prog. Phys. Geo. 29 (3), 392-410. doi: 10.1191/0309133305pp454ra. Kirby, W., 1972. Computer oriented Wilson-Hilferty transformation that preserves the first 3 moments and lower bound of the Pearson Type 3 distribution. Water Resour. Res. 10 (2), 220-222. Kitanidis, P. K., 1986. Parameter uncertainty in estimation of spatial functions: Bayesian analysis. Water Resour. Res. 22 (4), 499-507. Kjeldsen, T.R., Rosbjerg, D., 2002. Comparison of regional index flood estimation procedures Based on the extreme value type I distribution. Stoch. Env. Res. Risk A. 16, 358-373. Kjeldsen T.R., Jones, D.A., Bayliss, A.C., 2008. Improving the FEH Statistical Procedures for Flood Frequency Estimation, Final Research Report to the Environment Agency, R&D Project SC050050, CEH Wallingford, UK. Kjeldsen, T. R., Jones, D.A., 2009a. An exploratory analysis of error components in hydrological regression modelling. Water Resour. Res. 45, W02407, doi: 10.1029/2007WR006283. Kjeldsen, T.R., Jones, D.A., 2009b. A formal statistical model for pooled analysis of extreme floods. J. Hydrol. Res. 40 (5), 465-480, doi:10.2166/nh.2009.055. Kjeldsen, T.R., 2010. Modelling the impact of urbanization on flood frequency relationships in the UK. J. Hydrol. Res. 41 (5), 391-405, doi:10.2166/nh.2010.056. Koop, G., 2008. Introduction to Econometrics. John Wiley & Sons, Ltd, West Sussex, England. Kroll, C.N., Stedinger, J.R., 1999. Development of regional regression relationships with censored data. Water Resour. Res. (35) 3, 775-784. Kuczera, G., 1982. Combining site-specific and regional information: An emperical Bayes approach. Water Resour. Res. 18 (2), 306-314. Kuczera, G., 1982a. Robust flood frequency models. Water Resour. Res. 18 (2), 315-324. Kuczera, G., 1983a. A Bayesian surrogate for regional skew in flood frequency analysis. Water Resour. Res. 19 (3), 821-832.
REFERENCES
277
Kuczera, G., 1983b. Effect of sampling uncertainity and spatial correlation on an emperical Bayes procedure for combining site and regional information. J. Hydrol. 65, 373-398. Kuczera, G., 1992. Uncorrelated measurement error in flood frequency inference. Water Resour. Res. 28, 183-189. Kuczera, G., 1996. Correlated measurement error in flood frequency inference. Water Resour. Res. 32, 2119-2128. Kuczera, G., 1999a. Comprehensive at-site flood frequency analysis using Monte Carlo Bayesian inference. Water Resour. Res. 35 (5), 1551-1557. Kuczera, G., 1999b. FLIKE HELP, Chapter 2 FLIKE Notes, University of Newcastle. Kuczera, G., Parent, E., 1998. Monte Carlo assessment of parameter uncertainty in conceptual catchment models: the Metropolis algorithm. J. Hydrol. 211 (1-4), 69-85. Kuczera, G., Franks, S., 2005. At-site flood frequency analysis. Australian Rainfall and Runoff, Book IV, Draft Chapter 2. Kundzewicz, Z.W., Rosbjerg, D., Simonovic, S.P., Takeuchi, K., 1993. Extreme hydrological events in perspective. Extreme Hydrological Events: Precipitation, floods and droughts In Proc, Yokohama Symposium, July, 1993). IAHS Publ. no. 213. Laaha, G., Blöschl, G., 2007. A national low flow estimation procedure for Austria. Hydrol. Sci. J. 52(4), 625-644. doi:10.1623/hysj.52.4.625. Laio, F., 2004. Cramer-von Mises and Anderson-Darling goodness of fit tests for extreme value distributions with unknown parameters. Water Resour. Res. 40:W09308.doi:10.1029/2004WR003204. Laio, F., Di Baldassarre, G., Montanari, A., 2009. Model selection techniques for the frequency analysis of hydrological extremes. Water Resour. Res. 45:W07416.doi:10.1029/2007/WR006666. Lamontagne, J., Stedinger, J.R., Ferris, J., Knifong, D., Veilleux, A., Curry, D., 2011. Regional skews for 1-Day, 3-Day, 7-Day, 15-Day, and 30-day duration discharge for the central valley region of California, Report Series XXXX-XXXX, U.S. Geological Survey (in press). . Law G., Tasker, G.D., 2003. Flood-frequency prediction methods for unregulated streams of Tennessee, 2000. U.S. Geological Survey Water-Resources Investigations Report 03-4176. Leadbetter, M.R., Lindren, G., Rootzen, H., 1983. Extremes and related properties of random sequences and processes. New York: Springer. Leclerc, M., Ouarda T.B.M.J., 2007. Non-stationary regional flood frequency analysis at ungauged sites. J. Hydrol. 343, 254–265. Lim, Y.H., Voeller, D.L., 2009. Regional flood estimations in red river using L-Moment-based index-flood and Bulletin 17B procedures. J. Hydrol. Eng. 14, 1002-1016.
REFERENCES
278
Lu, L.-H., Stedinger, J.R., 1992. Sampling variance of normalized GEV/PWM quantile estimators and a regional homogeneity test. J. Hydrol. 138 (1–2), 223–245. Ludwing, A.H., Tasker, G.D., 1993. Regionalization of low-flow characteristics of Arkansas streams. U.S. Geological Survey Water-Resources Investigations Report 93-4013. Madsen, H., Rosbjerg, D., Harremoes, P., 1995. Application of the Bayesian approach in regional analysis of extreme rainfalls. Stoch. Hydrol. Hydraul. 9, 77-88. Madsen, H., Rosbjerg, D., 1997. Generalised least squares and empirical Bayes estimation in regional partial duration series index-flood modelling. Water Resour. Res. 33 (4), 771-782. Madsen, H., Pearson, C.P., Rosbjerg, D., 1997. Comparison of annual maximum series and partial duration series for modelling extreme hydrologic events, 2, Regional modelling. Water Resour. Res. 33 (4), 759-769. Madsen, H., Mikkelsen, P.S., Rosbjerg, D., Harremoes, P., 2002. Regional estimation of rainfall intensity duration curves using generalised least squares regression of partial duration series statistics. Water Resour. Res. 38 (11), 1-11. Madsen, H., Arnbjerg-Neilsen, K., Mikkelsen,P.S., 2009. Update of regional intensity-duration-frequency curves in Denmark: Tendency towards increased storm intensities. Atmos. Res. 92, 343-349. Majone, U., Tomirotti, M., 2004. A trans-national regional frequency analysis of peak flood flows. L’Aqua, 2/2004, 9-17. Majone, U., Tomirotti, M., Galimberti, G., 2007. A probabilistic model for the estimation of peak flood flows. Special Session 10, 32nd Congress of IAHR, Venice, Italy, July 1-6. Marin, C., 1983. Uncertainty in water resources planning, PhD thesis, Harvard Univ., Cambridge Mass. Marter, H., Martern, M., 2001. Multivariate analysis of quality: An introduction. John Wiley & Sons Ltd.: Chichester. Martins, E.S., Stedinger, J.R., 2000. Generalized maximum likelihood GEV quantile estimators for hydrologic data. Water Resour. Res. 28 (11), 3001- 3010. Martins, E.S., Stedinger, J.R., 2001. Historical information in a GMLE-GEV framework with partial duration and annual maximum series. Water Resour. Res. 37 (10), 2551-2557. Martins, E.S., Stedinger, J.R., 2002a. Cross-correlation among estimators of shape. Water Resour. Res. 38 (11), doi:10.1029/2002WR001589. Martins, E.S., Stedinger, J.R., 2002b. Efficient regional estimates of LP3 Skew using GLS regression. Proceedings of the ASCE Conference on Water Resources Planning and Management, May 19-22.
REFERENCES
279
Matalas, N.C., 1967. Mathematical assessment of synthetic hydrology. Water. Resour. Res. 3 (4), 937-945. Matalas, N.C., Benson, M.A., 1961. Effects on interstation correlation on regresson analysis. J. Geophys. Res. 66 (10), 3285-3293. Matalas, N.C., Gilroy, E.J., 1968. Some comments on regionalization in hydrologic studies. Water Resour. Res, 4 (6), 1361-1369. McConachy, F.L.N., Xuereb, K., Smythe, C.J., Gamble, S.K., 2003. Homogeneity of rare to extreme rainfalls over Tasmania. 28th International Hydrology and Water Resour. Symp, Wollongong, The Institution of Engineers, Australia. McCuen, R. H., Map skew???. 1979. J. Water Resour. Plan. and Manage. Div. ,ASCE, 105(WR2), 265-277 [with Closure 107(WR2), 582, 1981]. McCuen, R., Hromadka, T., 1988. Flood skew in hydrologic design on ungaged watersheds. J. Irrig. and Drain. Eng., 114, No. 2. McGilchrist, C.A., Woodyer, K.D., 1975. Note on a distribution free CUSUM technique. Technometrics, 17 (3), 321-325. Merz, R., Blöschl, G., 2005. Flood frequency regionalisation—spatial proximity vs. catchment attributes. J. Hydrol. 302, 283-306. Metropolis, N., Rosenbluth, A.W., Teller, A.H., Teller, E., 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1092. Micevski, T., Franks, S.W., Kuczera, G., 2006. Multidecadal variability in coastal eastern Australian flood data. J. Hydrol. 327, 219-225. Micevski, T., Kuczera, G., 2009. Combining site and regional flood information using a Bayesian Monte Carlo approach. Water Resour. Res. 45, W04405, doi: 10.1029/2008WR007173. Michaelsen, J. 1987. Cross-validation in statistical climate forecast models. J. Climate Appl. Meteor. 26, 1589-1600. Moisello, U., 2007. On the use of partial probability weighted moments in the analysis of hydrological extremes. Hydrol. Process. 21, 1265–1279. doi: 10.1002/hyp.6310. Moss, M.F., Karlinger, M.R., 1974. Surface water network design by regression simulation. Water Resour. Res. 10 (3), 427-433 Moss, M.E., Tasker, G.D., 1991. An intercomparison of hydrological network-design technologies. J. Hydrol. Sci. 36 (3), 209. Mosteller, F., Tukey, J.W. 1977. Data analysis and regression: A second course in statistics: Reading, Mass.:Addison-Wesley.
REFERENCES
280
Mulvany, T.J., 1851. On the use of self registering rain and flood gauges in making observations of the relation of rainfall and of flood discharge in a given catchment. Trans. ICE Ire, 4, 18-31. Nandakumar, N., Weinmann, P.E., Mein, R.G., Nathan, R.J., 1997. Estimation of extreme rainfalls for Victoria using the CRC-FORGE Method, Report 97/4, Monash University. Nandakumar, N., Weinmann, P.E., Mein, R.G., Nathan, R.J., 2000. Estimation of spatial dependence for the CRC-FORGE method. Proc. ‘Hydro 2000’ – 3rd International Hydrology and Water Resources Symposium, Perth. Inst. of Engineers, Australia, pp. 553-557. Nathan, R.J., McMahon, T.A., 1990. Identification of homogeneous regions for the purpose of regionalisation. J.Hydrol. 121 (4), 217-238. Nathan, R.J., Weinmann, P.E., 2001. The estimation of extreme floods – the need and scope for revision of our national guidelines, Aust. J. Water Eng. 1 (1), 40-50. National Research Council 1988. Estimating Probabilities of Extreme Floods: Methods and Recommended Research, 141pp., National Academy Press, Washington D.C., Natural Environment Research Council (NERC) 1975. Flood Studies Report, NERC, London. Ng, W. W., Panu, U. S., Lennox, W. C., 2007. Chaos based analytical techniques for daily extreme hydrological observations. J. Hydrol. 342, 17-41. Novotny, E.V., Stefan, H.G., 2007. Stream flow in Minnesota: Indicator of climate change. J. Hydrol. 334, 319– 333. O’Connell, D.R.H., Ostenaa, D.A., Levish, D.R., Klinger, R.E., 2002. Bayesian flood frequency analysis with paleohydrologic bound data. Water Resour. Res. 38 (5), 16 (1) to 16 (4). Olsen, J.R., Lambert, J.H. Haimes, Y.Y., 1999. Risk of extreme event under nonstationary conditions. Risk Analysis 18, 4, 497–510. Oncirculation, 2011. http://oncirculation.com/2012/05/22/20102011
Overeem, A., Buishand, A., Holleman, I., 2009. Rainfall depth-duration frequency curves and their uncertainties. J. Hydrol. 348, 124–134. Pandey, G.R., Nguyen, V.T.V., 1999. A comparative study of regression based methods in regional flood frequency analysis. J. Hydrol. 225, 92-101. Parrett, C., Vellieux, A., Stedinger, J.R., Barth, N.A., Knifong, D., Ferris, J.C., 2010. Regional skew for California and flood frequency for selected sites in the Sacramento-San Joaquin River Basin based on data through water year 2006, OFR XXXX, U.S. Geological Survey (in press). Pearson, C.P., 1991. New Zealand regional flood frequency analysis using L moments. J. Hydrol. New Zealand, 30 (2), 53-64.
REFERENCES
281
Pegram, G., 2002. Rainfall, rational formula and regional maximum flood – some scaling links. 27th National Hydrology and Water Resources Symp. 20-23 May, Melbourne. Pericchi L.R., Rodriguez-Iturbe, I., 1983. On some problems in Bayesian model choice in hydrology. The Statist. 32, 273-278. Pasquini, A.I., Depetris, P.J., 2007. Discharge trends and flow dynamics of South American rivers draining the southern Atlantic seaboard: An overview. J. Hydrol. 333, 385– 399. Petersen-Øverleir, A., Reitan, T. 2009. Accounting for rating curve imprecision in flood frequency analysis using likelihood-based methods. J. Hydrol. 366, 89-100. Picard, R.R., Cook, R.D., 1984. Cross-validation of regression models. J. Amer. Stat. Assoc. 21, 299-313. Pilgrim, D. H., 1986. Bridging the gap between flood research and design practice. Water Resour. Res. 22, Supplement, No. 9, 165S-176S. Pilgrim, D.H., 1986. Estimation of large and extreme floods, Civil Eng. Trans. Institute of Engineers Australia. CE28, 62-73. Pilgrim , D.H., Rowbottom, I.A., 1987. Chapter 13 – Estimation of large and extreme floods. In Pilgrim, D.H. (ed.), Australian Rainfall and Runoff: A Guide to Flood Estimation, I.E. Aust., Canberra. Pilgrim, D.H., Cordery, I., 1993. Flood runoff. In: Chapter 9, Handbook of Hydrology, edited by D. R. Maidment, McGraw-Hill, N.Y. Pilon, P.J., Adamowski, K., 1991. Asymptotic variance of flood quantile in log Pearson Type III distribution with historical information. J. Hydrol. 143, 481 503. Pilon, P.J., Adamowski, K., 1992. The value of regional information to flood frequency analysis using the method of L-moments. Can. J. Civ. Eng. 19 (1), 137-147. Potter, K.W., Walker, J.F., 1981. A model of discontinuous measurement error and its effects on the probability distribution of flood discharge measurements. Water Resour. Res. 17 (5), 1505-1509. Potter, K.W., Lettenmaier, D.P., 1990. A comparison of regional flood frequency estimation mean using a resampling method. Water Resour. Res. 26 (3), 424. Prudhomme, C., Jakob, D., Svensson, C., 2003. Uncertainty and climate change impact on the flood regime of small UK catchments. J. Hydrol. 277, 1-23. Pui, A., Lal, A., Sharma, A., 2011. How does the Interdecedal Pacific Oscillation affect design floods in Australia? Water Resour. Res. 47 (5), doi:10.1029/2010wr009420. Racine, J., 2000. Consistent cross-validitory method for dependant data: hv-block cross validation. J. Econ. 99, 39 – 61.
REFERENCES
282
Rahman, A., 1997. Flood Estimation for ungauged catchments: A regional approach using flood and catchment characteristics, PhD thesis, Department of Civil Engineering, Monash University. Rahman, A., Bates, B.C., Mein, R.G., Weinmann, P.E., 1999a. Regional flood frequency analysis for ungauged basins in south – eastern Australia. Aust. J. Water Resour. 3 (2), 199-207. Rahman, A., Weinmann, P.E., Mein R.G., 1999b. At-site flood frequency analysis: LP3- product moment, GEV-L moment and GEV-LH moment procedures compared In Proc 2nd Intl. Conference on Water Resour. and Env. Research,I.E Aust., 6-8 July, 1999; 2, pp715-720. Rahman, A., Hollerbach, D., 2003. Study of runoff coefficients associated with the Probabilistic Rational Method for flood estimation in South-east Australia. Proc. 28th Hydrology and Water Resources Symp. 10-13 Nov., Wollongong, pp. 199-203. Rahman, A., Haddad, K., Kuczera, G. and Weinmann, P.E., 2009. Regional flood methods for Australia: data preparation and exploratory analysis. Australian Rainfall and Runoff Revision Projects, Project 5 Regional Flood Methods, Stage I Report No. P5/S1/003, Nov 2009, Engineers Australia, Water Engineering, 181pp. Rahman, A., Haddad, K., Zaman, M., Ishak, E., Kuczera, G., Weinmann, P.E., 2011a. Regional flood methods, Stage II, Project 5 Report, School of Engineering, University of Western Sydney, Australia. Rahman, A., Haddad, K., Zaman, M., Kuczera, G., Weinmann, P.E., 2011b. Design flood estimation in ungauged catchments: A comparison between the Probabilistic Rational Method and Quantile Regression Technique for NSW. Aust. J. Water Resour. 14 (2), 127-140. Rao, C. R., Toutenburg, H., 1999. Linear models: Least squares and alternatives, Springer-Verlag, New York. Rao, R.A., Hamed, K., 2000. Flood frequency analysis. CRC Press LCC, 2000 NW Corporate Blvd., Boca Ranton, Florida. Reich, B.J., Shaby, B.A., 2012. A hierarchical max-stable spatial model for extreme precipitation. Accepted, Ann. Appl. Stat. Reis, Jr., D.S., Stedinger, J.R., Martins, E.S., 2003. Bayesian GLS regression with application to LP3 regional skew estimation. Proceedings World Water & Environmental Resources Congress 2003, Editors P. Bizier and P. DeBarry, Philadelphia, PA, American Society of Civil Engineers, June 23-26, 2003. Reis Jr., D.S., 2005. Flood frequency analysis employing Bayesian regional regression and imperfect historical information. PhD thesis, Cornell University. P.210. Reis Jr., D.S., Stedinger, J.R., 2005. Bayesian MCMC flood frequency analysis with historical information. J Hydrol. 313, 97-116.
REFERENCES
283
Reis Jr., D.S., Stedinger, J.R., Martins, E.S., 2005. Bayesian GLS regression with application to LP3 regional skew estimation. Water Resour. Res. 41, W10419, doi:10.1029/2004WR00344. Reitan, T., Petersen-Øverleir, A., 2008. Bayesian power-law regression with a location parameter, with applications for construction of discharge rating curves. Stoch. Env. Res. Risk A. 22, 351-365. Rencher, A. C., 2000. Linear models in statistics, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc, 2000. Riggs, H.C., 1973. Regional analyses of streamflow techniques. Techniques of Water Resources Investigations of the U.S. Geol. Surv., Book 4, Chapter B3, U.S. Geol. Surv., Washington D.C. Robson, A. J., Reed, D.W., 1999. Flood estimation handbook Vol 3: Statistical procedures for flood frequency estimation. Institute of Hydrology, Wallingford, United Kingdom. Rosbjerg, D., Madsen, H., 1994. Uncertainty measures of regional flood frequency analysis estimators. J. Hydrol. 167, 209-224. Rosbjerg, D., 2007. Regional flood frequency analysis. Hydrological events: New concepts for security. Springer Netherlands. Doi: 10.1007/978-1-4020-5741-0_12. Rossi, F., Fiorention, M., Versace, P., 1984. Two –component extreme value distribution for flood frequency analysis. Water Resour. Res. 20 (7), 847-856. Rosso, R., 1985. A linear approach to the influence of discharge measurement error on flood estimates. Hydrol. Sci. J. 30, 137-254. Rowbottom, I.A., Pilgrim, D.H., Wright, G.L. 1986. Estimation of rare floods (between the probable maximum flood and the 1 in 100 flood). Civil Eng. Trans. Institute of Engineers Australia. CE28, 92-105. Salas, J.D., Wold, E.E., Jarrett, R.D., 1994. Determination of flood characteristics using systematic, historical and paleoflood data, in G. Rossi et al., Coping with Floods, 111-134, Kluwer Academic Publishers, Netherlands. Sankarasubramanian, A., Lall, U. 2003. Flood quantiles in a changing climate: Seasonal forecasts and causal relations. Water Resour. Res. 39, 51134, doi:10.1029/2002WR001593. Scholz, F.W., Stephens, M.A., 1987. K-sample Anderson-Darling Tests, J. Am. Statist. Assoc. 82, 918–924. Schwarz, G., 1978. Estimating the dimension of a model. Ann. Stat. 60 (2), 461-464. Shao, J., 1993. Linear model selection by cross validation. J. Amer. Stat. Assoc. 88, 486-494. Shuzheng, C., Yinbo, X., 1987. The effect of discharge measurement error in flood frequency analysis. J. Hydrol. 96, 237-254.
REFERENCES
284
Sivapalan, M., Takeuchi, K., Franks, S.W., Gupta, V.K., Karambiri, H., Lakshmi, V., Liang, X., McDonnell, J.J., Mendiondo, E.M., O’Connell, P.E., Oki, T., Pomeroy, J.W., Schertzer, D., Uhlenbrook, S., Zehe, E., 2003. IAHS Decade on predictions in ungauged basins (PUB), 2003-2012: Shaping an exciting future for the hydrological sciences. Hydrol. Sci. J. 48 (6), 857-880. Smith, J.A., 1987. Estimating the upper tail of flood frequency distributions. Water Resour. Res. 23 (18), 1657-1666. Smith, J.A., 1992. Representation of basin scale in flood peak distributions. Water Resour. Res. 28 (11), 2993-2999. Song Xu, Q., Zeng Liang, Y., 2001. Monte Carlo cross validation. Chemo. Int. Lab. Sys. 56, 1-11. Song Xu, Q., Zeng Liang, Y., Ping Du, Y., 2005. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J. Chemo. 18, 112-120, doi:10.1002/cem.858. Stedinger, J.R., 1983a. Estimating a regional flood frequency distribution, Water Resour. Res. 19 (2), 503-510. Stedinger, J.R., Cohn, T.A., 1986. Flood frequency analysis with historical and paleoflood information. Water Resour. Res. 22 (5), 785- 793. Stedinger, J.R., Lu, L.H., 1995. Appraisal of regional and index flood quantiles estimators. Stoch. Hydrol. Hydraul. 9 (1), 49-75. Stedinger, J.R., Tasker, G.D., 1985. Regional hydrologic analysis, 1.Ordinary, weighted, and generalised least squares compared. Water Resour. Res. 21(9), 1421-1432. Stedinger, J.R., Tasker, G.D., 1986. Correction to Regional hydrologic analysis, 1.Ordinary, weighted, and generalised least squares compared. Water Resour. Res. 22 (5), 844. Stedinger, J.R., Tasker, G.D., 1986. Regional hydrologic analysis, 2. Model error estimators, estimation of sigma and log – Pearson type 3 distributions. Water Resour. Res. 22 (10), 1487-1499. Stedinger, J.R., Vogel, R.M., Foufoula-Georgiou, E., 1993. Frequency analysis of extreme events, in Handbook of Hydrology, McGraw Hill Book Co., NY, pp. 18.1-18.66 (Chapter 18). Stewart, E.J., Reed, D.W., Faulkner, D.S., Reynard, N.S., 1999. The FORGEX method of rainfall growth estimation I: Review of requirement. Hydrol. Earth Syst. Sci. 3 (2), 187-195. Stone, M., 1974. Cross validatory choice and assessment of statistical predictions. J. Royal Stat. Soc. 36 (2), 111-147. Strahler, A.N., 1950. Equilibrium theory of erosional slopes approached by frequency distribution analysis. Amer. J. Sci. 248, 673- 696, 800- 814.
REFERENCES
285
Sun, R., Chen, L., Bojie, F., 2011. Predicting monthly precipitation with multivariate regression methods using geographic and topographic information. J. Phys. Geo. 32 (3), 269-285. doi: 10.2747/0272-3646.32.3.269. Svensson, C., Jones, D.A., 2010. Review of rainfall frequency estimation methods. J. Flood Risk Manag. 3, 296-313, doi:10.1111/j.1753-318X.2010.01079.x Tasker, G.D., 1980. Hydrologic regression and weighted lest squares, Water Resour. Res. 16 (6), 1107-1113. Tasker, G.D., 1989. Regionalization of low flow characteristics using Logistic and GLS regression. In Proceedings of Symposium on New Directions for Surface Water Modeling, Baltimore, IAHS Publ. No. 181, 323-331. Tasker, G.D., Driver, N.E., 1988. Nationwide regression model for predicting urban Runoff water quality at unmonitored sites. Water Resour. Bul. 24 (5), 1091-1101. Tasker, G.D., Eychaner, J.H., Stedinger, J.R., 1986. Application of generalised least squares in regional hydrologic regression analysis. US Geol. Survey Water Supply Paper 2310, 107-115. Tasker, G.D., Hodge, S.A., Barks C.S., 1996. Region of Influence regression for estimating the 50-year flood at ungauged sites. Water Resour. Bull. 32(1), 163-170. Tasker, G.D., Moss, M.E., 1979. Analysis of Arizona flood data network for regional information. Water Resour. Res. 15 (6), 1791-1796. Tasker, G.D., Stedinger, J.R., 1986. Estimating generalised skew with weighted least squares regression, J. Water Resour. Plan. and Manage. 112 (2), 225-237. Tasker, G.D., Stedinger, J.R., 1987. Regional regression of flood characteristics employing historical information. In: W.H. Kirby, S.Q. Hua and L.R. Beard (ed), Analysis of Extra-ordinary flood events. J. Hydrol. 96, 255-264. Tasker, G.D., Stedinger, J.R., 1989. An Operational GLS model for hydrologic regression. J. Hydrol. 111, 361-375. Thomas, D.M., Benson, M.A., 1970. Generalization of streamflow characteristics from drainage basin characteristics. US Geological Survey Water Supply Paper 1975, pp. 55. Thomas, Jr., W.O., Olsen, S.A., 1992. Regional analysis of minimum streamflow. In: Proceedings of 12th Conference on Probability and Statistics in the Atmospheric Sciences, 5th International Meeting on Statistical Climatology, Toronto, Ont., 22-26 June, 1992, pp 261-266. Tsakiris, G., Nalbantis, I., Cavadias, G., 2011. Regionalization of low flows based on Canonical Correlation Analysis. Adv. Water Resour. 34, 865-872, doi: 10.1016/j.advwatres.2011.04.007. Tung, Y., Mays, L., 1981a. Generalized skew coefficients for flood frequency analysis. Water Resour. Bul. 17, No. 2.
REFERENCES
286
Tung, Y., Mays, L., 1981b. Reducing hydrologic parameter uncertainty. J. Water Resour. Plan. and Manage. Div. 107, No. WR1.
Van Gelder, P.H.AJ.M., Wang, W., Vrijling, J.K., 2007. Statistical estimation methods for extreme hydrological events. O.F. Vasiliev et al. (eds), Extreme hydrological events: New concepts for security, 199-252, Springer. Vannitsem, S., Naveau, P., 2007. Spatial dependences among precipitation maxima over Belgium. Nonlin. Processes. Geophys. 14, 621-630. Veilleux, A.G., Stedinger, J.R., Lamontagne, J.R., 2011. Bayesian WLS/GLS regression for regional skewness analysis for regions with large cross-correlations among flood flows. In: EWRI World Environmental and Water Resources Congress Palm Springs, California, United States, May 22-26, 2011. Venetis, C., 1970. A note on the estimation of the parameters in a logarithmic stage-discharge relationships with estimation of their error. Bulletin IASH. 15, 105-111. Vogel, R.M., Kroll, C.N., 1989. Low – frequency analysis using probability – plot correlation coefficients. J. Water Resour. Plann. Mgmt. ASCE, 115 (3), 338-357. Vogel, R.M., Kroll, C.N., 1990. Generalised low-flow frequency relationships for ungauged sites in Massachusetts. Water Resour. Bul. (26) 2, 241-253. Vogel, R.M., McMahon, T.A., Chiew, F.H.S., 1993. Flood flow frequency model selection in Australia. J. Hydrol. 146, 421-449. Vogel, R.M., Matalas. N.C., England, J.F., Castellarin, A., 2007. An assessment of exceedance probabilities of envelope curves. Water Resour. Res. 43, W07403, doi:10.1029/2006WR005586. Vrac, M., Naveau, P., Drobinski, P., 2007. Modeling pairwise dependencies in precipitation intensities. Nonlin. Processes. Geophys. 14, 789-797. Wallis, J.R., Wood, E.F., 1985. Relative accuracy of Log Pearson 3 procedures. J. Hydrol. 111, 1043-1057. Williamson, D.R., Van Der Wel B., 1991. Quantification of the impact of dryland salinity on the Mount Lofty Ranges, SA, Intl. Hydrology and Water Resour. Symp, 48-52. Wiltshire, S.E., 1986a. Identification of homogeneous regions for flood frequency analysis. J. Hydrol. 84 (3-4), 287-302. Wiltshire, S.E., 1986b. Regional flood frequency analysis I: Homogeneity statistics. Hydrol. Sci. J. 31 (3), 321-333. WMO, 1994. Guide to hydrological practices: data acquisition and processing, analysis, forecasting and other applications. WMO-No. 168, Geneva. Wood, E.F., Rodriguez-Iturbe, I., 1975. Bayesian inference and decision making for extreme hydrological events. Water Resour. Res. 11 (4), 533-542.
REFERENCES
287
Xuereb, K.C., Moore, G.J., Taylor, B.F., 2001. Development of the method of storm transposition and maximisation for the West Coast of Tasmania. Bureau of Meteorology, Australia Hydrology Report Series, HRS Report No.7, 2001. Zaman, M., Rahman, A., Haddad, K., Hagare, D., 2012. Identification of best-fit probability distribution for at-site flood frequency analysis: A case study for Australia, Hydrology and Water Resources Symposium, Engineers Australia, 19-22 Nov 2012, Sydney, Australia Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics, John Wiley and Sons, Inc., New York. Zhang, P., 1993. Model selection via multifold cross validation. Annal. Statist. 21, 299-313. Zhu, Y., Day, R.L., 2005. Analysis of streamflow trends and the effects of climate in Pennsylvania, 1971 to 2001. J. American Water Resour. Assoc. 41 (6), 1393-1405. Zrinji, Z., Burn, D.H., 1996. Regional flood frequency with hierarchical region of influence. J. Water Resour. Plann. Mgmt. ASCE, 122 (4), 245–252.
APPENDIX A
288
APPENDIX A
A.1 PUBLISHED PAPERS FROM THIS RESEARCH
Haddad, K., Rahman, A., Zaman, M. and Shrestha, S. (2012). Applicability of Monte Carlo Cross Validation Technique for Model Development and Validation in Hydrologic Regression Analysis Using Ordinary and Generalised Least Squares Regression. Journal of Hydrology, (ERA, Rank A*, Accepted with minor revision). Haddad, K., and Rahman, A. (2012). Regional flood frequency analysis in eastern Australia: Bayesian GLS regression-based methods within fixed region and ROI framework: Quantile Regression vs. Parameter Regression Technique. Journal of Hydrology, DOI:10.1016/j.jhydrol.2012.02.012 (ERA, Rank A*). Haddad, K., Rahman, A. and Stedinger, J.R. (2012). Regional Flood Frequency Analysis using Bayesian Generalized Least Squares: A Comparison between Quantile and Parameter Regression Techniques. Hydrological Processes, 26(7), 1008-1021, DOI: 10.1002/hyp.8189 (ERA, Rank A)
Haddad, K., Rahman, A. and Kuczera, G. (2011). Comparison of Ordinary and Generalised Least Squares Regression Models in Regional Flood Frequency Analysis: A Case Study for New South Wales. Australian Journal of Water Resources, 15(2), 1-12 (ERA, Rank B).
Rahman, A., Haddad, K., Zaman, M., Kuczera, G. and Weinmann, P.E. (2011). Design flood estimation in ungauged catchments: A comparison between the Probabilistic Rational Method and Quantile Regression Technique for NSW. Australian Journal of Water Resources, 14(2), 127-140 (ERA, Rank B). Haddad, K., Rahman, A. and Weinmann, P.E. (2011). Estimation of major floods: applicability of a simple probabilistic model, Australian Journal of Water Resources, 14(2), 117-126 (ERA, Rank B).
Haddad, K., Rahman, A., Weinmann, P.E., Kuczera, G. and Ball, J.E. (2010). Streamflow data preparation for regional flood frequency analysis: Lessons from south-east Australia. Australian Journal of Water Resources, 14(1), 17-32 (ERA, Rank B) .
Haddad, K., Zaman, M. and Rahman, A. (2010). Regionalisation of skew for flood frequency analysis: a case study for eastern NSW. Australian Journal of Water Resources, 14(1), 33-41 (ERA, Rank B).
Haddad, K. and Rahman, A. (2010). Selection of the best fit flood frequency distribution and parameter estimation procedure – A case study for Tasmania in Australia, Stochastic Environmental Research & Risk Assessment, DOI: 10.1007/s00477-010-0412-1 (ERA, Rank B).
APPENDIX B
289
APPENDIX B
B.1 FURTHER RESULTS ASSOCIATED WITH VICTORIA AND
QUEENSLAND (FROM CHAPTER 5)
Table 52 Summary of the final BGLSR results for VIC
Posterior moment Statistics GLSR model
(VIC) Regression coefficient
Mean St
Dev AVPO AVPN AIC BIC BPV
%
2GLSR
j2 0.29 0.042
く0 (constant) 3.22 0.10 0 く1 (LN area) 0.61 0.040 0.31 0.29 0.31 0.32 0 63%
Mean µ
く2 (LN 2I12) 1.50 0.28 0
j2 0.043 0.012 Standard deviation j く0 (constant) 1.16 0.10 0 く1 (LN rain) -0.83 0.10 0.048 0.046 0.074 0.077 1 65% く2 (LN evap) 1.49 0.65 2
j2 0.034 0.027 Skewness け く0 (constant) -0.65 0.051 0
く1 (LN rain) 0.74 0.15 0.042 0.040 0.113 0.118 1 70% く2 (LN evap) -3.25 1.26 1
Flood quantiles j2 0.27 0.039
QARI=2 く0 (constant) 3.38 0.099 0 く1 (LN area) 0.90 0.089 0.28 0.27 0.29 0.30 0 63% く2 (LN Itc,ARI =2) 1.35 0.32 0
QARI=5 j2 0.29 0.043
く0 (constant) 4.17 0.10 0 く1 (LN area) 0.92 0.098 0.31 0.30 0.32 0.33 0 61% く2 (LN Itc,ARI =5) 1.32 0.35 0
QARI=10 j2 0.35 0.039
く0 (constant) 4.55 0.11 0 く1 (LN area) 0.94 0.055 0.37 0.35 0.38 0.39 0 57% く2 (LN Itc,ARI =10) 1.42 0.35 0
QARI=20 j2 0.35 0.036
く0 (constant) 4.82 0.12 0 く1 (LN area) 0.97 0.066 0.37 0.35 0.40 0.41 0 57% く2 (LN Itc,ARI =20) 1.50 0.36 0
QARI=50 j2 0.47 0.050
く0 (constant) 5.17 0.14 0 く1 (LN area) 0.99 0.073 0.49 0.47 0.53 0.56 0 49% く2 (LN Itc,ARI =50) 1.62 0.42 4
QARI=100 j2 0.59 0.067
く0 (constant) 5.24 0.17 0 く1 (LN area) 0.98 0.075 0.60 0.60 0.60 0.64 0 45% く2 (LN Itc,ARI =100) 1.63 0.46 5
APPENDIX B
290
Table 53 Summary of the final BGLSR results for QLD
Posterior moment Statistics GLSR model (QLD)
Regression coefficient
Mean St
Dev AVPO AVPN AIC BIC BPV
%
2GLSR
j2 0.23 0.032
く0 (constant) 4.71 0.074 0 く1 (LN area) 0.74 0.043 0.24 0.23 0.27 0.28 0 77%
Mean µ
く2 (LN 2I12) 1.97 0.15 0
j2 0.13 0.015 Standard deviation j く0 (constant) 1.37 0.10 0 く1 (LN area) -0.025 0.032 0.13 0.13 0.20 0.20 42 35% く2 (LN
2I12) -1.41 0.13 2
j2 0.015 0.014 Skewness け く0 (constant) -0.63 0.066 0
く1 (LN 50I72) -0.32 0.19 0.026 0.025 0.18 0.18 8 46% く2 (LN rain) 0.36 0.18 4
Flood quantiles j2 0.26 0.036
QARI=2 く0 (constant) 4.80 0.079 0 く1 (LN area) 1.35 0.078 0.27 0.26 0.28 0.29 0 75% く2 (LN Itc,ARI =2) 2.57 0.19 0
QARI=5 j2 0.17 0.026
く0 (constant) 5.77 0.080 0 く1 (LN area) 1.16 0.075 0.18 0.17 0.17 0.18 0 79% く2 (LN Itc,ARI =5) 1.95 0.17 0
QARI=10 j2 0.18 0.028
く0 (constant) 6.25 0.079 0 く1 (LN area) 1.00 0.058 0.19 0.18 0.19 0.20 0 74% く2 (LN Itc,ARI =10) 1.67 0.13 0
QARI=20 j2 0.14 0.025
く0 (constant) 6.59 0.10 0 く1 (LN area) 0.99 0.065 0.16 0.15 0.18 0.19 0 77% く2 (LN Itc,ARI =20) 1.42 0.17 0
QARI=50 j2 0.17 0.029
く0 (constant) 6.97 0.094 0 く1 (LN area) 0.91 0.073 0.19 0.18 0.21 0.22 0 72% く2 (LN Itc,ARI =50) 1.19 0.19 0
QARI=100 j2 0.20 0.033
く0 (constant) 7.23 0.099 0 く1 (LN area) 0.86 0.078 0.22 0.21 0.25 0.26 0 72% く2 (LN Itc,ARI =100) 1.01 0.20 0
APPENDIX B
291
Figure 66 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, fixed
region, VIC)
Figure 67 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, ROI,
VIC)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
1 2 3 4 5 6 7
Fitted LN(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (FIXED REGION) BGLSR-PRT (FIXED REGION)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
1.5 2.5 3.5 4.5 5.5 6.5 7.5
Fitted LN(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (ROI) BGLSR-PRT (ROI)
APPENDIX B
292
Figure 68 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, fixed
region, VIC)
Figure 69 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, ROI,
VIC)
ARI 20 (FIXED REGION)
-3-2.5
-2
-1.5-1
-0.50
0.51
1.52
2.53
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT
BGLSR-PRT
ARI 20 (ROI)
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT
BGLSR-PRT
APPENDIX B
293
Figure 70 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, fixed
region, QLD)
Figure 71 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, ROI,
QLD)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
4 5 6 7 8 9
Fitted LN(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT (FIXED REGION) BGLSR-PRT (FIXED REGION)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9
Fitted LN(Q 20)
Sta
ndar
dise
d R
esid
ual
BGLSR-QRT ( ROI) BGLSR-PRT (ROI)
APPENDIX B
294
Figure 72 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, fixed
region, QLD)
Figure 73 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, ROI,
QLD)
ARI 20 (FIXED REGION)
-3-2.5
-2-1.5
-1-0.5
00.5
11.5
22.5
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT
BGLSR-PRT
ARI 20 (ROI)
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
-3 -2 -1 0 1 2 3
Standardised Residual
Nor
mal
Sco
re
BGLSR-QRT
BGLSR-PRT
APPENDIX C
295
APPENDIX C
C.1 FURTHER RESULTS ASSOCIATED WITH THE LFRM (FROM
CHAPTERS 7 AND 8)
VIC
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
Figure 74 L-moment ratio diagram of annual maximum flood series data for VIC
WA
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
Figure 75 L-moment ratio diagram of annual maximum flood series data for WA
APPENDIX C
296
SA
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
Figure 76 L-moment ratio diagram of annual maximum flood series data for SA
TAS
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
Figure 77 L-moment ratio diagram of annual maximum flood series data for TAS
APPENDIX C
297
NT
-0.15
-0.05
0.05
0.15
0.25
0.35
0.45
-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65
L-Skewness
L-K
urto
sis
GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve
Figure 78 L-moment ratio diagram of annual maximum flood series data for NT
0
4
8
12
16
20
24
1 10 100 1000 10000
ARI (Years)
Sta
ndar
dise
d D
ata
Observed Data
GEV (NSW)
GPA (NSW)
P3 (NSW)
Figure 79 Visual inspection of distributional fit for GEV, GPA and P3 distributions for NSW
APPENDIX C
298
0
2
4
6
8
10
12
1 10 100 1000 10000
ARI (Years)
Sta
ndar
dise
d D
ata
Observed data
GEV (VIC)
GPA (VIC)
P3 (VIC)
Figure 80 Visual inspection of distributional fit for GEV, GPA and P3 distributions for VIC
50250
2.2
2.0
1.8
1.6
1.4
50250
4.0
3.5
3.0
2.5
2.0
50250
9.0
7.5
6.0
4.5
3.0
3001500
2.0
1.9
1.8
1.7
1.6
3001500
4.0
3.5
3.0
2.5
3001500
8
7
6
5
4
N = 2
Experiment Number
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
TAS
Varying correlation coefficient from 0 to 0.5
Figure 81 Variation of Ne with different network methods and experiment number for TAS region (top
panel for real data and bottom panel for simulated data)
APPENDIX C
299
2.22.01.81.61.41.2
10.0
7.5
5.0
2.5
0.04.03.53.02.52.0
12
9
6
3
08.47.26.04.83.62.4
16
12
8
4
0
2.082.001.921.841.761.681.601.52
20
15
10
5
04.23.93.63.33.02.72.4
40
30
20
10
08.88.07.26.45.64.84.03.2
40
30
20
10
0
N = 2Fr
eque
ncy
of N
eN = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
2
0
1
7
99
5
10
88
12
3
8
13
8
4
7
10
5
10
12
5
3
11
8
10
17
23
12
20
14
10
17
13
2221
910
7
18
1415
11
17
15
4
15
23
11
3
10
30
1112
33
6
21
29
1
2224
6
42
8
16
34
1
4
17
29
1
7
27
14
3
0
18
28
5
0
12
30
9
4
36
11
0
2625
TAS
Figure 82 Frequency of Ne with different network methods for TAS region (top panel for real data and
bottom panel for simulated data)
100500
2.50
2.25
2.00
1.75
1.50
100500
5
4
3
2
100500
8
6
4
2
3001500
2.0
1.9
1.8
1.7
1.6
3001500
4.0
3.5
3.0
2.53001500
8
7
6
5
4
N = 2
Experiment Number
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
NT
Varying correlation coefficient from 0 to 0.5
Figure 83 Variation of Ne with different network methods and experiment number for NT region (top
panel for real data and bottom panel for simulated data)
APPENDIX C
300
2.42.22.01.81.61.4
24
18
12
6
04.84.23.63.02.41.8
16
12
8
4
09.07.56.04.53.01.5
16
12
8
4
0
2.082.001.921.841.761.681.60
30
20
10
04.23.93.63.33.02.72.4
40
30
20
10
08.88.07.26.45.64.84.03.2
40
30
20
10
0
N = 2Fr
eque
ncy
of N
eN = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
10
22
10
14
26
21
15
9
22
001
3
67
8
1312
910
16
555
6
11
7
14
54
3
7
13
11
15
4
2
1
5
24
18
15
19
11
24
18
11
7
18
14
11
14
1615
11
14
27
11
21
8
33
9
13
34
7
2325
2
30
19
10
35
15
36
6
11
37
3
0
10
27
14
00
34
17
01
22
26
2
10
37
5
32
18
NT
Figure 84 Frequency of Ne with different network methods for NT region (top panel for real data and
bottom panel for simulated data)
3001500
2.5
2.0
1.5
1.03001500
5
4
3
2
3001500
8
6
4
2
3001500
2.0
1.9
1.8
1.7
1.6
3001500
4.0
3.5
3.0
2.53001500
8
7
6
5
4
N = 2
Experiment Number
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
WA
Varying correlation coefficient from 0 to 0.5
Figure 85 Variation of Ne with different network methods and experiment number for WA region (top
panel for real data and bottom panel for simulated data)
APPENDIX C
301
2.22.01.81.61.41.2
24
18
12
6
05.04.54.03.53.02.52.01.5
40
30
20
10
09.68.47.26.04.83.62.4
30
20
10
0
2.082.001.921.841.761.681.60
20
15
10
5
04.23.93.63.33.02.72.4
40
30
20
10
08.88.07.26.45.64.84.03.2
40
30
20
10
0
N = 2
Freq
uenc
y of
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
3
0
3
1
9
12
17
21
18
24
18
12
23
15
1817
1515
24
67
33
6
222
7
15
24
34
21
4138
27
32
35
13
1
4
14
19
16
20
30
18
31
1312
13
25
28
2119
2
7
12
10
17
202020
10
151515
8
17
22
6
1919
13
4
16
14
16
5
2
9
30
14
21
24
6
38
11
7
37
7
2426
2
39
11
7
2321
0
15
23
13
2
12
33
4
0
9
33
9
0
23
27
1
7
41
3
WA
Figure 86 Frequency of Ne with different network methods for WA region (top panel for real data and
bottom panel for simulated data)
50250
2.0
1.8
1.6
1.4
1.2
50250
3.2
2.8
2.4
2.0
50250
4.5
4.0
3.5
3.0
2.5
3001500
2.0
1.9
1.8
1.7
1.6
3001500
4.0
3.5
3.0
2.53001500
8
7
6
5
4
N = 2
Experiment Number
Ne
N = 4 N = 8
N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)
SA
Varying correlation coefficient from 0 to 0.5
Figure 87 Variation of Ne with different network methods and experiment number for SA region (top
panel for real data and bottom panel for simulated data)
APPENDIX C
302
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
R2 G
LSR
0.0080
0.0085
0.0090
0.0095
0.0100
0.0105
0.0110R-sqd GLSR MEV Standard error of MEV
Figure 88 Selection of predictor variables for the BGLSR model for CV - WA
0.00
0.05
0.10
0.15
0.20
0.25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 89 Selection of predictor variables for the BGLSR model for CV using AVPO, AVPN, AIC and
BIC - WA
APPENDIX C
303
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
MEV Standard error of MEV R-sqd GLSR
Figure 90 Selection of predictor variables for the BGLSR model for the mean flood – WA
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 91 Selection of predictor variables for the BGLSR model for the mean flood using AVPO,
AVPN, AIC and BIC - WA
APPENDIX C
304
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
R2 G
LSR
0.0140
0.0145
0.0150
0.0155
0.0160
0.0165
0.0170
0.0175
0.0180
0.0185R-sqd GLSR MEV Standard error of MEV
Figure 92 Selection of predictor variables for the BGLSR model for CV – TAS
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 93 Selection of predictor variables for the BGLSR model for CV using AVPO, AVPN, AIC and
BIC - TAS
APPENDIX C
305
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics
MEV Standard error of MEV R-sqd GLSR
Figure 94 Selection of predictor variables for the BGLSR model for the mean flood – TAS
0.00
0.50
1.00
1.50
2.00
2.50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Combination of Catchment Characteristics
AVPO AVPN AIC BIC
Figure 95 Selection of predictor variables for the BGLSR model for the mean flood using AVPO,
AVPN, AIC and BIC - TAS
APPENDIX D
306
APPENDIX D
D.1 L-MOMENT RATIO DIAGRAMS AND GOODNESS-OF-FIT TEST
Hosking (1990) introduced the L-moment ratio diagram for the purpose of selecting
suitable distributions in frequency analysis. An L-moment ratio diagram compares sample
estimates of L-Skewness (LSK) and L-Kurtosis (LKT) with their population counterparts,
for a range of assumed distributions.
Hosking and Wallis (1991) presented a goodness-of-fit measure based on r , the regional
average of the sample (LKT) 4 , mainly for three–parameter distributions. Since all three–
parameter distributions fitted to the data will have the same (LSK) 3 on the LSK vs. L-
Coefficient of variation (LCV) LCV diagram, the quality of fit can be judged by the
difference between regional average 4 and the value of DIST
4 for the fitted distribution.
The statistic ZDIST is defined below:
444 /)( DISTDISTZ (D.1)
Which is a goodness-of-fit measure, 4 is the standard deviation of 4 . The value of
4 can be obtained by simulation after fitting a Kappa distribution to the observations
(Hosking, 1988). A fit is declared adequate if DISTZ is sufficiently close to zero, a
reasonable criterion being DISTZ 1.64.
APPENDIX D
307
D.2 ANDERSON-DARLING MONTE CARLO SIMULATION GOODNESS-
OF-FIT TEST
Given a sample xi (i = 1, . . . , n) of annual maximum flood data extracted from a
distribution FR(x), the test is used to check the null hypothesis H0 : FR(x) = F(x, θ), where
F(x, θ) is the hypothetical distribution and θ is an array of parameters estimated from the
sample xi. The Anderson-Darling (AD) goodness-of-fit test measures the departure between
the hypothetical distribution F(x, θ) and the cumulative frequency function Fn(x) defined as:
Fn(x) = 0 , x < x(1)
Fn(x) = i/n , x(i) x < x(i+1)
Fn(x) = 1 , x(n) x (D.2)
where x(i) is the i-th element of the ordered sample (in increasing order). The test statistic is:
)()(),()(2
2 xdFxxFxFnAx
n (D.3)
where )(x , in the case of the AD test (Laio, 2004), is )(x = [F (x,θ) (1- F (x,θ))]-1. In
practice, the statistic is calculated as:
n
i
ii xFinxFin
nA1
)()(
2 )],(1ln[)212()],(ln[)12(1 (D.4)
The statistic 2A , obtained in this way, may be compared with the population of the 2A ’s
that one obtains if the sample essentially belongs to the F (x, θ) hypothetical distribution. In
the case of the test of normality, this distribution is defined as can be seen in Laio (2004).
For other distributional cases, e.g. P3 or GEV the test statistics can be derived using Monte
Carlo simulation as done here. The results in Table 7.1 for the AD test are reported as P-
values for a significance level of 5%. Hence a value of P > 0.95 suggests that the particular
distribution as the parent is not significant / unsupported.
APPENDIX D
308
D.3 HOMOGENEITY TEST OF HOSKING AND WALLIS
The Hosking and Wallis test assesses the homogeneity of a group of catchments at three
different levels by focussing on three measures of dispersion for different orders of the
sample L-moment ratios (see Hosking (1990) for an explanation of L-moments).
A measure of dispersion for the LCV:
R
i
ii
R
i
i nttnV1
2
2)(2
1
1 )( (D.5)
A measure of dispersion for both the LCV and the LSK coefficients in the LCV-LSK
space:
R
i
i
R
i
iii nttttnV1
2/1
1
2
3)(3
2
2)(22 )()( (D.6)
A measure of dispersion for both the LSK and the LKT coefficients in the LSK-LKT space:
R
i
i
R
i
iii nttttnV1
2/1
1
2
4)(4
2
3)(33 )()( (D.7)
where 2t , 3t and 4t are the group mean of LCV, LSK and LKT respectively; )(2 it , )(3 it ,
)(4 it and in are the values of LCV, LSK, LKT and the sample size for i; and R is the number
of sites in the pooling group.
The underlying concept of this test is to measure the sampling variability associated with
the L-moment ratios and compare it with the variation that would be expected for a
homogenous group. The expected mean value and standard deviation of these dispersion
measures for a homogeneous group, kV and
kV respectively, are assessed through
repeated simulations, by generating homogeneous groups of catchments having the same
record length as those of the observed data following the methodology proposed by
Hosking and Wallis (1993). The heterogeneity measures are then evaluated using the
following expression:
APPENDIX D
309
3,2,1for;)H( kV
k
k
k
V
Vk
(D.8)
Hosking and Wallis (1993) suggested that the region or group of sites should be considered
as ‘acceptably homogeneous’ if H < 1; ‘possibly heterogeneous’ if 2H1 , and
‘definitely heterogeneous’ if H 2 .
D.4 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST
The AD test is based on the comparison between the local and regional empirical
distribution functions. The empirical distribution function, or sample distribution function,
is defined by njxF /)( , ,)1()( jj xxx where n is the sample and )( jx are the order
statistics, i.e. the observations arranged in ascending order. Denote the empirical
distribution function of the i-th sample (local) by )(ˆ xFi , and that of the pooled sample of all
knnN ...1 (regional) by )(xH N .
The k-sample Anderson-Darling test statistic is then defined as:
)()](1)[(
)]()(ˆ[ 2
1
xdHxHxH
xHxFn N
xall NN
Nik
i
iAD
(D.9)
If the pooled ordered sample is ,...1 NZZ the computational formula to evaluate AD is:
k
i
N
j
iij
i
ADjNj
jnNM
nN 1
1
1
2
)(
)(11 (D.10)
where ijM is the number of observations in the i-th sample that are not greater than jZ . The
homogeneity test can be carried out by comparing the obtained AD value to the tabulated
percentage points reported by Scholz and Stephens (1987) for the different significance
levels.
APPENDIX D
310
The statistic AD depends on the sample values only through their ranks. This guarantees
that the test statistic remains unchanged when the samples undergo monotonic
transformation, an important stability property not possessed by the Hosking and Wallis
(1993) heterogeneity measure. However, problems arise in applying this test in a common
index value procedure. In fact, the index value procedure corresponds to dividing each site
sample by a different value, thus modifying the ranks in the pooled sample. In particular,
this has an effect of making the local empirical functions much more similar to each other,
providing an impression of homogeneity even when the samples are highly heterogeneous.
The effect is equivalent to that encountered when applying goodness-of-fit tests to
distributions whose parameters are estimated from the same sample used for the test (e.g.
D’Agostino and Stephens, 1986 and Laio, 2004). In both the cases, the percentage points
for the test should be opportunely recalculated. This may be achieved with a nonparametric
bootstrap approach, which is presented in the following steps:
1. Build up the pooled sample S of the observed non-dimensional data.
2. Sample with replacement from S and generate k simulated local samples of
size knn ,...,1 .
3. Divide each sample for its index value and calculate )1(
AD .
4. Repeat the procedure for Nsim times and obtain a sample of )( j
AD , j = 1,…, Nsim
values, whose empirical distribution function can be used as an approximation
),( ADH oG the distribution of AD under the null hypothesis of homogeneity.
The acceptance limits for the test, corresponding to any significance level , are then easily
determined as the quantiles of )( ADH oG corresponding to probability (1- ). The result is
usually reported as a P-value.
APPENDIX D
311
D.5 GUMBEL VARIATES CORRESPONDING TO ARI
Table 54 Values of YT corresponding to ARI
ARI YT
2 0.37
5 1.50
10 2.25
20 2.97
50 3.90
100 4.60
200 5.30
500 6.21
1000 6.91
2000 7.60
3000 8.01