Western Sydney University · 2013. 12. 8. · PRELIMINARIES iv by coupling it with the BGLSR – ROI technique to estimate the mean and coefficient of variation (CV) of AMFS data

Regional Flood Frequency Analysis in the

Range of Small to Large Floods: Development

and Testing of Bayesian Regression-based

Approaches

Khaled Haddad

A thesis submitted for the degree of Doctor of Philosophy at the

University of Western Sydney, Sydney, Australia

June 2013

PRELIMINARIES

ii

ABSTRACT

Design flood estimation in the range of frequent to medium (2 – 100 years) and large to

rare (greater than 100 and up to 2000 years) average recurrence intervals (ARI) is

frequently required in the design of many engineering works such as design of culverts,

bridges, farm dams, spill ways, land use planning and flood insurance studies. These sorts

of infrastructure works and investigations are of notable economic significance.

Design flood estimation is ideally made adopting a flood frequency analysis technique;

however, this needs a relatively longer period of recorded streamflow data. In many cases,

recorded streamflow data is quite short or completely absent (i.e. ungauged catchment

situation). In such cases, regional flood frequency analysis (RFFA) techniques are usually

adopted, which attempts to utilise spatial data to compensate for temporal data on the

assumption of regional homogeneity.

This thesis focuses on RFFA techniques, in particular how the RFFA techniques can be

enhanced by adopting an ensemble of advanced statistical techniques as well as by

minimising the error and noise often found in the flood data. This thesis uses data from 682

catchments from the continent of Australia to (i) develop prediction equations involving

readily obtainable catchment characteristics data for floods in the frequent to medium range

ARIs (2 – 100 years) (ii) investigate the validation of the developed prediction equations

using the most commonly used leave-one-out validation (LOO) and to compare it with the

more recent Monte Carlo cross validation (MCCV) technique and (iii) to develop a large

flood regionalisation model (LFRM) that corrects for spatial dependence in the annual

maximum flood series data (AMFS) for flood estimation in the large to rare flood range

(100 – 2000 years ARI).

The first part of this thesis advocates the use of regression-based RFFA methods under the

Bayesian generalised least squares regression (BGLSR) framework. Here, the BGLSR has

been developed and tested with the quantile regression technique (QRT) and the parameter

regression technique (PRT) using 452 catchments from the east coast of Australia (namely

New South Wales (NSW), Victoria, Queensland and Tasmania). In forming the regions,

both the fixed region and region of influence (ROI) approaches have been examined in the

range of frequent to medium ARI floods.

PRELIMINARIES

iii

A LOO validation indicated that the ROI based on the minimisation of the predictive

uncertainty leads to more efficient and accurate flood quantiles estimates in both the QRT

and PRT regional frameworks. The regression diagnostics reveal that the catchment

characteristics variables alone may not pick up all the heterogeneity in the regional model

and formation of ROI sub-regions can reduce the heterogeneity level to an acceptable limit.

Both the BGLSR based QRT-ROI and PRT-ROI methods show improvements in regional

heterogeneity with an increase in the average pseudo coefficient of determination and a

decrease in the model error variance, average variance of prediction and the average

standard error of prediction. Based on the evaluation statistics, overall it has been found

that there are only modest differences between the QRT-ROI and PRT-ROI regional

frameworks. The developed RFFA methods based on the QRT-ROI and PRT-ROI allow

design flood estimation along with its associated uncertainty (in the form of confidence

limits) to be made with a relatively high degree of accuracy.

The second part of this thesis looks at the detailed validation of regional hydrological

regression models by investigating the popular LOO validation and the relatively new

MCCV procedures using 96 catchments from the state of NSW. In this regard, both the

ordinary least squares regression (OLSR) and GLSR have been tested for the estimation of

flood quantiles using simulated and observed regional flood data. From the simulation and

real data examples, it has been found that when developing regional hydrologic regression

models, application of GLSR based MCCV validation procedure is likely to result in a

more parsimonious model than the OLSR based LOO, OLSR based MCCV and GLSR

based LOO validation procedures.

The third part of this thesis proposes a simple LFRM that accounts for spatial dependence

in the AMFS data for estimating large to rare floods. To carry this out a comprehensive

dataset from all over the Australian continent has been used that consists of 654 stations.

The new LFRM is easy to use and offers an alternative to the traditional rainfall-based

methods. The development and application of the simplified LFRM for the Australian

continent consists of three major steps (i) pooling the top 1 to 5 annual maximum flood

values from member sites in a region (ii) developing a new spatial dependence model to

correct for spatially correlated data (iii) application of the LFRM to ungauged catchments

PRELIMINARIES

iv

by coupling it with the BGLSR – ROI technique to estimate the mean and coefficient of

variation (CV) of AMFS data.

To this end a simple model for the effective number of independent stations (Ne) has been

developed that ignores possible variation with ARI. Meaningful results regarding spatial

dependence have been established by undertaking the analysis on simulated datasets to

counteract sampling and homogeneity issues.

Overall, the experimental results of the analysis show that, in general, spatial dependence

decreases with larger network size and that some Australian states exhibit more spatial

dependence than others. While there are some limitations with this analysis, a reasonable

indication of the behaviour of Ne has been established. The derived generalised spatial

dependence model has then been used with the LFRM to correct for the spatial dependence

by adjusting the plotting position points of the LFRM frequency distribution curve.

An independent validation has showed that the developed LFRM is able to estimate design

floods for 100 to 1000 years ARIs with reasonable confidence as compared to at-site flood

frequency analysis results, other regional flood models and the world model. Overall, the

newly developed LFRM that corrects for spatially correlated data and coupled with

BGLSR - ROI approach offers a powerful yet simple method of regional flood estimation

for floods in the large to rare ARI range.

PRELIMINARIES

v

COPYRIGHT STATEMENT

‘I hereby grant the University of Western Sydney or its agents the right to archive and to

make available my thesis or dissertation in whole or part in the University libraries in all

forms of media, now or here after known, subject to the provisions of the Copyright Act

1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in

future works (such as articles or books) all or part of this thesis or dissertation. I have either

used no substantial portions of copyright material in my thesis or I have obtained

permission to use copyright material; where permission has not been granted I have

applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.’

Signed Khaled Haddad

PRELIMINARIES

vi

STATEMENT OF AUTHENTICATION

‘I hereby declare that the work presented in this thesis is solely my own work and that to

the best of my knowledge the work is original except where otherwise indicated by

references to other authors or works. No part of this thesis has been submitted for any other

degree or diploma.’

Signed Khaled Haddad

PRELIMINARIES

vii

ACKNOWLEDGMENTS

Firstly I would like to acknowledge the contribution of my supervisor Dr Ataur Rahman in

providing direction, advice and encouragement over the course of the last three and a half

years. I really appreciate your support.

I also appreciate the advice and friendship of other academics and researchers in the School

of Computing, Engineering and Mathematics at UWS. In particular, thanks to Associate

Professor Surendra Shrestha and Associate Professor Chin Leo.

The advice and friendship from other universities and industry is also gratefully

acknowledged. In particular Mr Erwin Weinmann of Monash University for his

constructive comments, valuable guidance, advice and encouragement throughout this

research, Professor George Kuczera, Associate Professor James Ball, Mr Mark Babister,

Mr Robert French and Dr William Weeks for their suggestions and input to the research. A

special thanks goes to Dr Nanda Nandakumar for his helpful advice on various aspects of

the issues relating to spatial dependence and large flood estimation.

I would also like to acknowledge various government departments throughout Australia for

their help and contribution in providing the streamflow data for this study. Without their

timely support this research would have not been completed on time.

To my fellow PhD students, thanks for all your help, fun times and the support of knowing

we’re not alone through the ups and downs. Thank you to my parents and family for

teaching me the value of education and hard work, which gave me the confidence to

embark on this mission.

PRELIMINARIES

viii

TABLE OF CONTENTS

ABSTRACT......................................................................................................................... II

COPYRIGHT STATEMENT............................................................................................ V

STATEMENT OF AUTHENTICATION........................................................................VI

ACKNOWLEDGMENTS ............................................................................................... VII

TABLE OF CONTENTS ...............................................................................................VIII

LIST OF FIGURES ........................................................................................................ XVI

LIST OF TABLES ........................................................................................................ XXII

COMMON NOTATIONS.............................................................................................XXV

ABBREVIATIONS.....................................................................................................XXVII

CHAPTER 1: INTRODUCTION....................................................................................... 1

1.1 GENERAL ................................................................................................................................1

1.2 BACKGROUND.......................................................................................................................1

1.3 THE NEED FOR THIS RESEARCH.......................................................................................7

1.4 RESEARCH QUESTIONS.......................................................................................................8

1.5 MAJOR TASKS........................................................................................................................9

1.6 CONTRIBUTIONS OF THIS RESEARCH TO THE UNDERSTANDING OF THE RFFA

PROBLEM....................................................................................................................................10

1.7 OUTLINE OF THE THESIS AND CHAPTER INTRODUCTIONS....................................10

CHAPTER 2: REVIEW OF REGIONAL FLOOD FREQUENCY ANALYSIS

TECHNIQUES, MODEL VALIDATION AND LARGE FLOODS ............................ 15

2.1 GENERAL ..............................................................................................................................15

2.2 BASIC ISSUES.......................................................................................................................15

2.2.1 REGIONAL FLOOD FREQUENCY ANALYSIS .............................................................15

2.2.2 REGIONAL HOMOGENEITY.........................................................................................16

2.2.3 INTER – SITE DEPENDENCE .......................................................................................17

2.2.4 DISTRIBUTIONAL CHOICES ........................................................................................18

2.3 METHODS FOR IDENTIFICATION OF HOMOGENEOUS REGIONS............................19

PRELIMINARIES

ix

2.4 REGIONAL FLOOD FREQUENCY ANALYSIS METHODS – DIFFERENT

APPROACHES.............................................................................................................................21

2.4.1 INDEX FLOOD METHOD..............................................................................................21

2.4.2 STATION YEAR METHOD .............................................................................................24

2.4.3 BAYESIAN ANALYSIS AND MONTE CARLO METHODS ............................................24

2.4.4 PROBABILISTIC RATIONAL METHOD AS USED IN AUSTRALIA .............................25

2.5 QUANTILE AND PARAMETER REGRESSION TECHNIQUES ......................................27

2.5.1 INTRODUCTION ............................................................................................................27

2.5.2 GENERALISED LEAST SQUARES AND WEIGHTED LEAST SQUARES REGRESSION

..................................................................................................................................................29

2.5.3 PREVIOUS APPLICATION OF GENERALISED LEAST SQUARES AND BAYESIAN

GENERALISED LEAST SQUARES REGRESSION .................................................................30

2.6 FIXED REGIONS AND THE REGION OF INFLUENCE IN REGIONAL FLOOD

FREQUENCY ANANALYS........................................................................................................33

2.6.1 FORMATION OF REGIONS...........................................................................................33

2.6.2 REGION OF INFLUENCE VS FLEXIBLE REGION......................................................33

2.7 MODEL VALIDATION IN HYDROLOGICAL REGRESSION ANALYSIS.....................36

2.7.1 HISTORY OF MODEL VALIDATION ............................................................................37

2.7.2 PREVIOUS APPLICATIONS OF LEAVE-ONE-OUT VALIDATION IN HYDROLOGY38

2.8 REGIONAL FLOOD FREQUENCY FOR LARGE TO RARE FLOODS............................40

2.8.1 BRIEF REVIEW OF LARGE FLOOD ESTIMATION AND PREVIOIUS

APPLICATIONS .......................................................................................................................40

2.9 IMPACT OF CLIMATE CHANGE ON FLOOD FREQUENCY ANALYSIS.....................44

2.10 SUMMARY ..........................................................................................................................45

CHAPTER 3: ADOPTED STATISTICAL TECHNIQUES FOR REGIONAL

FLOOD FREQUENCY ANALYSIS AND MODEL VALIDATION........................... 47

3.1 GENERAL ..............................................................................................................................47

3.2 AT-SITE FLOOD FREQUENCY ANALYSIS......................................................................49

3.2.1 BASICS OF AT-SITE FLOOD FREQUENCY ANALYSIS...............................................49

3.2.2 FLIKE SOFTWARE FOR AT-SITE FFA .........................................................................50

3.2.3 LOG PEARSON TYPE 3 (LP3) DISTRIBUTION............................................................50

3.3 THE CLASSICAL GLS REGRESSION PROBLEM ............................................................51

3.3.1 GLSR, THE STEDINGER AND TASKER MODEL .........................................................53

3.4 BAYESIAN METHODOLOGY.............................................................................................55

3.4.1 CLASSICAL BAYESIAN INFERENCE............................................................................56

3.5 BAYESIAN GLS REGRESSION ..........................................................................................56

PRELIMINARIES

x

3.5.1 APPROACH ADOPTED IN THIS STUDY FOR THE QUANTILE AND PARAMETER

REGRESSION TECHNIQUES .................................................................................................56

3.5.2 ADOPTED BAYESIAN REGRESSION APPROACH – PRIOR FOR THE β

COEFFICIENTS.......................................................................................................................59

3.5.3 ANALYTICAL SOLUTION TO BAYESIAN APPROACH FOR THE POSTERIOR OF

THE MODEL ERROR VARIANCE...........................................................................................60

3.5.4 PRIORS FOR THE PARAMETERS AND THE QUANTILES OF THE LP3

DISTRIBUTION........................................................................................................................62

3.6 SELECTING PREDICTOR VARIABLES ............................................................................64

3.6.1 AVERAGE VARIANCE OF PREDICTION .....................................................................64

3.6.2 BAYESIAN AND AKAIKE INFORMATION CRITERIA..................................................65

3.6.3 BAYESIAN PLAUSIBILTY VALUE .................................................................................65

3.6.4 COEFFICIENT OF DETERMINATION .........................................................................66

3.6.5 OTHER MODEL SELECTION CRITERIA......................................................................66

3.7 FORMATION OF REGIONS.................................................................................................67

3.8 REGRESSION DIAGNOSTICS.............................................................................................69

3.8.1 STANDARD ERROR OF PREDICTION .........................................................................70

3.8.2 RESIDUAL ANALYSIS ....................................................................................................70

3.8.3 COOK’S DISTANCE .......................................................................................................71

3.9 EVALUATION STATISITCS................................................................................................71

3.10 REGIONAL UNCERTAINTY WITH FLOOD QUANTILE ESTIMATION.....................72

3.10.1 THE MULTIVARIATE NORMAL DISTRIBUTION.......................................................73

3.11 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION MODELS –

METHODOLOGY........................................................................................................................76

3.11.1 THE HYDROLOGICAL REGRESSION PROBLEM .....................................................76

3.11.2 MODEL SELECTION BY MONTE CARLO CROSS VALIDATION .............................78

3.11.3 ESTIMATING MSEP .....................................................................................................80

3.11.4 APPLICATION – USING SIMULATED DATA.............................................................81

3.11.5 OBSERVED REGIONAL FLOOD DATA FROM NSW, AUSTRALIA ..........................83

3.12 SUMMARY ..........................................................................................................................84

CHAPTER 4: STUDY AREA AND PREPARATION OF STREAMFLOW AND

CATCHMENT CHARACTERISITICS DATA ............................................................. 85

4.1 GENERAL ..............................................................................................................................85

4.1.1 PUBLICATIONS..............................................................................................................86

4.2 STUDY AREA........................................................................................................................86

4.3 SELECTION OF CANDIDATE CATCHMENTS.................................................................87

PRELIMINARIES

xi

4.4 STREAMFLOW DATA PREPARATION.............................................................................89

4.4.1 FILLING MISSING RECORDS IN ANNUAL MAXIMUM FLOOD SERIES..................89

4.4.2 TREND ANALYSIS ..........................................................................................................89

4.4.3 RATING CURVE ERROR AND IDENTIFICATION .......................................................90

4.4.4 SENSIVITY ANALYSIS AND IMPACT OF RATING CURVE EXTRAPOLATION ON

FLOOD QUANTILE ESTIMATES............................................................................................92

4.4.5 TESTS FOR OUTLIERS ..................................................................................................94

4.5 RESULTS OF STREAMFLOW DATA PREPARATION PROCESS ..................................95

4.5.1 DATA PREPARATION FOR VICTORIA.........................................................................95

4.5.2 DATA PREPARATION FOR NSW AND ACT .................................................................99

4.5.3 SENSITIVITY ANALYSIS - IMPACT OF RATING CURVE ERROR ON FLOOD

QUANTILE ESTIMATES........................................................................................................102

4.6 SUMMARY RESULTS OF STREAMFLOW DATA PREPARATION FOR THE OTHER

STATES ......................................................................................................................................104

4.6.1 TASMANIA ....................................................................................................................104

4.6.2 QUEENSLAND..............................................................................................................105

4.6.3 SOUTH AUSTRALIA .....................................................................................................105

4.6.4 NORTHERN TERRITORY .............................................................................................105

4.6.5 WESTERN AUSTRALIA ................................................................................................105

4.6.6 SUMMARY OF STREAMFLOW DATA AUSTRALIA WIDE ........................................105

4.7 SELECTION AND ABSTRACTION OF CATCHMENT CHARACTERISITCS .............108

4.8 SUMMARY ..........................................................................................................................112

CHAPTER 5: RESULTS – RFFA BASED ON FIXED REGIONS AND REGION OF

INFLUENCE APPROACHES UNDER THE QUANTILE AND PARAMETER

REGRESSION FRAMEWORKS .................................................................................. 114

5.1 GENERAL ............................................................................................................................114

5.1.1 PUBLICATIONS............................................................................................................114

5.2 RESULTS FOR TASMANIA...............................................................................................115

5.2.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT ..................................115

5.2.2 PSUEDO ANOVA WITH QRT AND PRT MODELS FOR THE FIXED AND ROI

REGIONS................................................................................................................................119

5.2.3 ASSESMENT OF MODEL ASSUMPTIONS AND REGRESSION DIAGNOSTICS ......122

5.2.4 POSSIBLE SUBREGIONS IN TASMANIA....................................................................127

5.2.5 EVALUATION STATISTICS..........................................................................................128

5.3 SECTION SUMMARY ........................................................................................................130

5.4 RESULTS FOR NEW SOUTH WALES, VICTORIA AND QUEENSLAND ...................130

PRELIMINARIES

xii

5.4.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT ..................................131

5.5 REGION OF INFLUENCE VS. FIXED REGIONS FOR PARAMETER AND QUANTILE

REGRESSION TECHNIQUES ..................................................................................................140

5.5.1 REGRESSION DIAGNOSTICS – PSEUDO ANALYSIS OF VARIANCE......................140

5.5.2 REGRESSION DIAGNOSTICS – MODEL ADEQUACY AND OUTLIER ANANLYSIS

................................................................................................................................................144

5.5.3 DIAGNOSTIC STATISTICS...........................................................................................148

5.5.4 EVALUATION STATISTICS..........................................................................................152

5.6 SECTION SUMMARY ........................................................................................................156

5.7 UNCERTAINTY ESTIMATION FOR NEW SOUTH WALES, VICTORIA,

QUEENSLAND AND TASMANIA IN A ROI-PRT FRAMEWORK......................................157

5.8 SUMMARY ..........................................................................................................................160

CHAPTER 6: RESULTS - MODEL VALIDATION USING LOO AND MCCV .... 161

6.1 GENERAL ............................................................................................................................161

6.1.1 PUBLICATIONS............................................................................................................161

6.2 RESULTS .............................................................................................................................162

6.2.1 PREDICTORS USED ....................................................................................................162

6.2.2 SIMULATED DATA.......................................................................................................164

6.2.3 APPLICATION WITH OBSERVED REGIONAL FLOOD DATA IN NSW ...................169

6.3 SUMMARY ..........................................................................................................................175

CHAPTER 7: BACKGROUND AND DEVELOPMENT OF THE LARGE FLOOD

REGIONALISATION MODEL AND ISSUES RELATING TO SPATIAL

DEPENDENCE................................................................................................................ 177

7.1 GENERAL ............................................................................................................................177

7.1.1 PUBLICATIONS............................................................................................................177

7.2 LFRM CONCEPT.................................................................................................................178

7.3 INTER-SITE DEPENDENCE IN GENERAL FOR THE LFRM........................................178

7.4 ANNUAL MAXIMUM DATA SET USED IN THE LFRM...............................................182

7.4.1 QUALITY CHECK OF THE LARGEST ANNUAL MAXIMA DATA.............................183

7.5 IDENTIFICATION OF AN APPROPRIATE PROBABILITYY DISTRIBUTION AND

TESTING FOR HOMOGENITY OF ANNUAL MAXIMA FLOOD DATA...........................184

7.5.1 SEARCHING FOR AN APPROPRIATE PROBABILITY DISTRIBUTION...................184

7.5.2 GOODNESS-OF-FIT TEST RESULTS..........................................................................185

7.6 HOMOGENEITY .................................................................................................................190

7.6.1 HOMOGENEITY TEST OF HOSKING AND WALLIS .................................................190

PRELIMINARIES

xiii

7.6.2 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST ..........................191

7.6.3 TESTING FOR HOMOGENEITY – RESULTS..............................................................191

7.7 DEVELOPMENT OF THE LFRM MODEL FOR AUSTRALIAN FLOOD DATA..........192

7.7.1 DEVELOPMENT AND CALIBRATION OF THE LFRM MODEL...............................193

7.8 EFFECTS OF INTER-SITE DEPENDENCE ON THE LFRM MODEL............................201

7.8.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS ............................................201

7.8.2 REGIONAL MAXIMUM FLOOD AT A NETWORK OF SITES - REGIONAL

MAXIMUM AND TYPICAL CURVES....................................................................................202

7.8.3 FACTORS INFLUENCING THE REGIONAL MAXIMUM ..........................................203

7.8.4 NUMBER OF SITES, N .................................................................................................203

7.8.5 CROSS CORRELATION................................................................................................204

7.8.6 DEFINITION OF A REGION FOR ANALYSIS.............................................................205

7.8.7 METHODS OF SAMPLING REGIONAL MAXIMA......................................................205

7.8.8 ROI AND RANDOM ROI NETWORK METHODS .......................................................205

7.8.9 THE TOTAL RANDOM NETWORK METHOD............................................................206

7.8.10 COMPARING SAMPLING METHODS ......................................................................206

7.9 MEASURES OF Ne – EFFECTIVE NUMBER OF INDEPENDENT STATIONS ..............207

7.9.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS, Ne ......................................207

7.9.2 A SIMPLE MODEL FOR Ne ..........................................................................................209

7.9.3 FITTING Ne BY THE MEAN .........................................................................................210

7.10 SIMULATED DATASETS ................................................................................................211

7.10.1 SYNTHETIC DATA GENERATION ............................................................................211

7.11 SUMMARY ........................................................................................................................215

CHAPTER 8: APPLICATION OF LFRM IN THE LIGHT OF SPATIAL

DEPENDENCE – RESULTS AND DISCUSSION....................................................... 217

8.1 GENERAL ............................................................................................................................217

8.2 RESULTS FOR Ne ................................................................................................................217

8.3 A CLOSER LOOK AT THE BEHAVIOUR OF Ne ..............................................................219

8.4 GENERALISING THE Ne MODEL......................................................................................223

8.4.1 CONSTANT Ne MODEL – AN EMPIRICAL RELATIONSHIP FOR Ne BASED ON

AVERAGE CORRELATION COEFFICENT (ρ) ....................................................................223

8.4.2 FURTHER DISCUSSION..............................................................................................230

8.5 COMPARISON OF THE EFFECTIVE RECORD LENGTH ESTIMATES USING THE

CONSTANT Ne MODEL FOR THE REAL AND SIMULATED DATASETS .........................230

8.6 REVISITING THE LFRM IN THE LIGHT OF SPATIAL DEPENDENCE ......................231

8.7 APPLICATION OF THE LFRM MODEL TO UNGAUGED CATCHMENTS.................240

PRELIMINARIES

xiv

8.7.1 DERIVATION OF PRIORS FOR THE MEAN FLOOD AND CV.................................240

8.7.2 ESTIMATION OF THE ERROR COVARIANCE MATRIX – ESTIMATION OF THE

SAMPLING ERROR VARIANCE............................................................................................241

8.7.3 ESTIMATION OF THE SAMPLING ERROR – INTER-SITE CORRELATION............242

8.7.4 SOME ISSUES ASSOCIATED WITH REGIONAL ESTIMATION OF CV....................243

8.7.5 SELECTION OF PREDICTOR VARIABLES ................................................................244

8.7.6 BGLSR RESULTS FOR MEAN AND CV ......................................................................244

8.7.7 BGLSR RESULTS FOR MEAN AND CV MODELS USING ROI .................................248

8.8 VALIDATION......................................................................................................................251

8.9 SUMMARY ..........................................................................................................................257

CHAPTER 9: CONCLUSIONS ..................................................................................... 259

9.1 INTRODUCTION.................................................................................................................259

9.2 OVERVIEW OF THE STUDY ............................................................................................260

9.2.1 DATA SELECTION (CHAPTER 4) ...............................................................................260

9.2.2 RFFA IN THE FREQUENT TO MEDIUM ARI RANGE (CHAPTER 5) ......................261

9.2.3 MCCV VS LOO (CHAPTER 6)......................................................................................261

9.2.4 LARGE TO RARE FLOOD ESTIMATION (CHAPTERS 7 and 8)................................261

9.3 CONCLUSIONS...................................................................................................................262

9.3.1 DESIGN FLOOD ESTIMATION IN THE FREQUENT TO MEDIUM ARI RANGE....262

9.3.2 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION MODELS..............263

9.3.3 LARGE TO RARE FLOOD ESTIMATION....................................................................263

9.4 LIMITATIONS AND SUGGESTIONS FOR FUTURE RESEARCH ................................264

REFFRENCES................................................................................................................. 268

APPENDIX A................................................................................................................... 288

A.1 PUBLISHED PAPERS FROM THIS RESEARCH ............................................................288

APPENDIX B ................................................................................................................... 289

B.1 FURTHER RESULTS ASSOCIATED WITH VICTORIA AND QUEENSLAND (FROM

CHAPTER 5) ..............................................................................................................................289

APPENDIX C................................................................................................................... 295

C.1 FURTHER RESULTS ASSOCIATED WITH THE LFRM (FROM CHAPTERS 7 AND 8)

.....................................................................................................................................................295

APPENDIX D................................................................................................................... 306

D.1 L-MOMENT RATIO DIAGRAMS AND GOODNESS-OF-FIT TEST.............................306

PRELIMINARIES

xv

D.2 ANDERSON-DARLING MONTE CARLO SIMULATION GOODNESS-OF-FIT TEST

.....................................................................................................................................................307

D.3 HOMOGENEITY TEST OF HOSKING AND WALLIS ...................................................308

D.4 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST...........................309

D.5 GUMBEL VARIATES CORRESPONDING TO ARI........................................................311

PRELIMINARIES

xvi

LIST OF FIGURES

Figure 1 Flash flooding in Emerald Central Queensland (Oncirculation, 2011) ................ 2

Figure 2 Flow chart showing statistical techniques/ methods adopted in this thesis........... 48

Figure 3 Example of ROI techniques applied in this study ................................................. 69

Figure 4 Use of multivariate normal distribution to develop confidence limits by Monte

Carlo simulation........................................................................................................... 75

Figure 5 Plot of the selected study area (i.e. NSW, VIC, QLD and TAS) .......................... 87

Figure 6 Plot of rating ratios (RR) for station 222202......................................................... 92

Figure 7 Rating curve extension error ................................................................................. 94

Figure 8 (a) Time series plot showing significant trends after 1995 and (b) CUSUM test

plot showing significant trends after 1995. Here Vk is CUSUM test statistic defined in

McGilchrist and Wodyer (1975).................................................................................. 96

Figure 9 Histogram of rating ratios of annual maximum flood data in Victoria (stations

with record lengths > 25 years).................................................................................... 97

Figure 10 Distributions of streamflow record lengths of the selected 131 stations from

Victoria ........................................................................................................................ 98

Figure 11 Distributions of catchment areas of the 131 catchments from Victoria .............. 99

Figure 12 Histogram of rating ratios for 106 stations from NSW ..................................... 100

Figure 13 Distributions of streamflow record lengths of the selected 96 stations from NSW

.................................................................................................................................... 101

Figure 14 Distributions of catchment areas of the 96 catchments from NSW .................. 101

Figure 15 (a) Distribution of annual maximum flood record lengths of 682 stations from all

over Australia (b) Distribution of catchment areas of 682 stations from all over

Australia..................................................................................................................... 106

Figure 16 Geographical distributions of the selected 682 stations from all over Australia107

Figure 17 Selection of predictor variables for the BGLSR model for Q10 (QRT, fixed region

Tasmania), MEV = model error variance, AVPO = average variance of prediction

(old), AVPN = average variance of prediction (new), AIC = Akaike information

criterion, BIC = Bayesian information criterion, note 2

GLSR uses right hand axis ..... 118

Figure 18 Selection of predictor variables for the BGLSR model for skew...................... 119

PRELIMINARIES

xvii

Figure 19 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT

and PRT, fixed region, Tasmania) ............................................................................. 123


and PRT, ROI, Tasmania).......................................................................................... 123

Figure 21 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and

PRT, fixed region, Tasmania).................................................................................... 124


PRT, ROI, Tasmania) ................................................................................................ 124

Figure 23 Cook’s distance (Di) for locating outlier sites for skew model based on variable

combination 4............................................................................................................. 125

Figure 24 Spatial variations of the grouped minimum model error variances for Tasmania

(a) mean flood model and (b) skew model ................................................................ 128

Figure 25 Selection of predictor variables for the BGLSR model for the skew (note that

2GLSR uses the right-hand axis)................................................................................... 133

Figure 26 Selection of predictor variables for the BGLSR model for Q10 model (note that

uses the right-hand axis), (QRT, fixed region NSW), MEV = model error variance,

AVPO = average variance of prediction (old), AVPN = average variance of prediction

(new) AIC = Akaike information criteria, BIC = Bayesian information criteria....... 135

Figure 27 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT

and PRT, fixed region and ROI, NSW) ..................................................................... 144


PRT, fixed region, ROI, NSW).................................................................................. 146

Figure 29 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT

and PRT, ROI and PRT-ROI with weighted average standard deviation and skew,

NSW) ......................................................................................................................... 147


PRT, ROI, and PRT ROI with weighted average standard deviation and skew, NSW)

.................................................................................................................................... 148

Figure 31 Spatial variations of the grouped minimum model error variances for (a) mean

flood model and (b) number of sites which produced the lowest predictive variance for

the mean flood model................................................................................................. 152

Figure 32 Boxplots of Qpred/Qobs ratios for NSW for QRT and PRT, with fixed and ROI

regions........................................................................................................................ 155

PRELIMINARIES

xviii

Figure 33 Design flood quantile estimation and confidence limits curves for ARIs of 2 to

100 years .................................................................................................................... 159

Figure 34 The mean squared error of prediction (MSEP) associated with LOO and MCCV

for OLSR and GLSR simulations .............................................................................. 167

Figure 35 Prediction error plot for Q10 results (models selected by OLSR and GLSR LOO

and models selected by OLSR and GLSR MCCV) ................................................... 172

Figure 36 Prediction error plot for Q100 results (models selected by OLSR and GLSR LOO

and models selected by OLSR and GLSR MCCV) ................................................... 174

Figure 37 Occurrences of the highest floods – data from NSW, QLD, VIC and TAS are

combined (only the highest value from each station’s AMFS data is taken to form the

LFRM data series)...................................................................................................... 180

Figure 38 Cross-correlation between two nearby Victorian Stations 221201 and

221207(Considering all concurrent AMFS data over the period of records – only 21

data points are concurrent for the pair of stations) .................................................... 180

Figure 39 Relationship between the cross-correlations among AMFS data and distance

between pairs of stations in Victoria.......................................................................... 182

Figure 40 Geographical distribution of the 28 validation catchments for the LFRM ....... 183

Figure 41 L-moment ratio diagrams of annual maximum flood data for NSW and QLD 186

Figure 42 Visual inspection of distributional fit for GEV, GPA and P3 distributions for WA

and TAS ..................................................................................................................... 189

Figure 43 Scatter of Qmax/mean data in the (CV(Q), Qmax/mean) plane and non linear

interpolation function................................................................................................. 195

Figure 44 Scattering of Ymax data in the (CV(Q), Ymax) plane and linear interpolation

function for the pooling of 1 (1 max) and 5 (5 max) top maxima ............................. 197

Figure 45 Frequency distribution of the standardised Ymax values.................................... 200

Figure 46 Average concurrent record lengths for different network sizes ........................ 204

Figure 47 Example plot of regional maximum and typical growth curves and the effective

number of independent stations on a Gumbel plot for a random network of 2 and 4

gauging sites in Tasmania.......................................................................................... 209

Figure 48 Example plot of generated data with different constant correlation coefficients

for the state of Tasmania............................................................................................ 213

Figure 49 Variation of Ne with different network methods and experiment number for

NSW+QLD+VIC region (top panel for real data and bottom panel for simulated data)

.................................................................................................................................... 221

PRELIMINARIES

xix

Figure 50 Frequency of Ne with different network methods for NSW+QLD+VIC region

(top panel for real data and bottom panel for simulated data) ................................... 221

Figure 51 Regression results of the N = 2 network combining the lnNe/lnN ratio values for

all the Australian states/regions and experiments...................................................... 225





Figure 54 Comparison of directly computed Ne from the AMFS data and Ne by the constant

Ne model..................................................................................................................... 229

Figure 55 Variation with number of sites: effective record lengths estimated using real and

simulated Ne models as a function of average correlation coefficient ....................... 231

Figure 56 Frequency distribution of standardised Ymax values using N and Ne stations..... 234

Figure 57 Various Qmax/mean quantiles derived from the LFRM_Ne model and PM (World)

model.......................................................................................................................... 236

Figure 58 Empirical frequency distributions of Q/mean quantiles derived from the

LFRM_N and LFRM_Ne for different ranges of CV ................................................ 239

Figure 59 Relationship between CV and catchment area .................................................. 243

Figure 60 Selection of predictor variables for the BGLSR model for CV ........................ 246

Figure 61 Selection of predictor variables for the BGLSR model for CV using AVPO,

AVPN, AIC and BIC ................................................................................................. 246

Figure 62 Selection of predictor variables for the BGLSR model for the mean flood...... 247

Figure 63 Selection of predictor variables for the BGLSR model for the mean flood using

AVPO, AVPN, AIC and BIC .................................................................................... 248

Figure 64 Prior and posterior pdf's for the model error variance for CV (right) and the mean

flood (left) models for NSW state.............................................................................. 251

Figure 65 Confidence interval plot of BIASr values with the LFRM_Ne and PM (world)

models for the 28 test catchments.............................................................................. 256


and PRT, fixed region, VIC)...................................................................................... 291


and PRT, ROI, VIC) .................................................................................................. 291


PRT, fixed region, VIC)............................................................................................. 292

PRELIMINARIES

xx


PRT, ROI, VIC) ......................................................................................................... 292


and PRT, fixed region, QLD) .................................................................................... 293


and PRT, ROI, QLD) ................................................................................................. 293


PRT, fixed region, QLD) ........................................................................................... 294


PRT, ROI, QLD)........................................................................................................ 294

Figure 74 L-moment ratio diagram of annual maximum flood series data for VIC.......... 295

Figure 75 L-moment ratio diagram of annual maximum flood series data for WA .......... 295

Figure 76 L-moment ratio diagram of annual maximum flood series data for SA............ 296

Figure 77 L-moment ratio diagram of annual maximum flood series data for TAS ......... 296

Figure 78 L-moment ratio diagram of annual maximum flood series data for NT ........... 297

Figure 79 Visual inspection of distributional fit for GEV, GPA and P3 distributions for

NSW........................................................................................................................... 297

Figure 80 Visual inspection of distributional fit for GEV, GPA and P3 distributions for VIC

.................................................................................................................................... 298

Figure 81 Variation of Ne with different network methods and experiment number for TAS

region (top panel for real data and bottom panel for simulated data)........................ 298

Figure 82 Frequency of Ne with different network methods for TAS region (top panel for

real data and bottom panel for simulated data).......................................................... 299

Figure 83 Variation of Ne with different network methods and experiment number for NT


Figure 84 Frequency of Ne with different network methods for NT region (top panel for real

data and bottom panel for simulated data)................................................................. 300

Figure 85 Variation of Ne with different network methods and experiment number for WA


Figure 86 Frequency of Ne with different network methods for WA region (top panel for

real data and bottom panel for simulated data).......................................................... 301

Figure 87 Variation of Ne with different network methods and experiment number for SA


Figure 88 Selection of predictor variables for the BGLSR model for CV - WA .............. 302

PRELIMINARIES

xxi


AVPN, AIC and BIC - WA ....................................................................................... 302

Figure 90 Selection of predictor variables for the BGLSR model for the mean flood – WA

.................................................................................................................................... 303


AVPO, AVPN, AIC and BIC - WA .......................................................................... 303

Figure 92 Selection of predictor variables for the BGLSR model for CV – TAS............. 304


AVPN, AIC and BIC - TAS ...................................................................................... 304

Figure 94 Selection of predictor variables for the BGLSR model for the mean flood – TAS

.................................................................................................................................... 305


AVPO, AVPN, AIC and BIC - TAS.......................................................................... 305

PRELIMINARIES

xxii

LIST OF TABLES

Table 1 Flood quantile estimates and associated errors using ARR FLIKE with and without

consideration of rating curve error............................................................................. 103

Table 2 Summary of selected stations Australia wide ....................................................... 107

Table 3 Catchment characteristics variables used in the study.......................................... 109

Table 4 Different combinations of predictor variables considered for the QRT models and

the parameters of the LP3 distribution (QRT and PRT fixed region Tasmania) ....... 117

Table 5 Pseudo ANOVA table for Q20 model for Tasmania (QRT, fixed region and ROI)

.................................................................................................................................... 120


.................................................................................................................................... 121

Table 7 Pseudo ANOVA table for the mean flood model for Tasmania (PRT, fixed region

and ROI)..................................................................................................................... 121

Table 8 Pseudo ANOVA table for the standard deviation model for Tasmania (PRT, fixed

region and ROI) ......................................................................................................... 121

Table 9 Pseudo ANOVA table for the skew model for Tasmania (PRT, fixed region and

ROI) ........................................................................................................................... 122

Table 10 Regression diagnostics for fixed region and ROI for Tasmania......................... 126

Table 11 Model error variances associated with fixed region and ROI for Tasmania (n =

number of sites in the region) .................................................................................... 127

Table 12 Evaluation statistics (RMSEr and REr) from leave-one-out (LOO) validation for

Tasmania .................................................................................................................... 129

Table 13 Summary of counts/percentages based on the rr values for QRT and PRT for

Tasmania (fixed region). “U” = gross underestimation, “D” = desirable range and “O”

= gross overestimation ............................................................................................... 129


Tasmania (ROI). “U” = gross underestimation, “D” = desirable range and “O” = gross

overestimation............................................................................................................ 129

Table 15 Summary of the final BGLSR results for NSW ................................................. 131

Table 16 Summary of the catchment characteristics and statistical measures used in the

stepwise regression for the parameters of the LP3 distribution for NSW ................. 136

PRELIMINARIES

xxiii

Table 17 Summary of the catchment characteristics and statistical measures used in the

forward stepwise regression for the flood quantiles of the LP3 distribution (ARIs = 2,

10 and 100 years) for NSW ....................................................................................... 137

Table 18 Pseudo ANOVA table for the mean flood model (PRT, fixed region and ROI,

NSW, VIC and QLD states) (Here n = number of sites in the region, k = number of

predictors in the regression equation, EVR = error variance ratio, 2

0 = model error

variance when no predictor variable is used in the regression model, 2 = model error

variance when predictor variable is used in the regression model and )]ˆ([ ytr = sum of

the diagonals of the sampling covariance matrix) ..................................................... 141

Table 19 Pseudo ANOVA table for the skew model (PRT, fixed region and ROI, NSW,

VIC and QLD states) (variables are explained in Table 18 caption)......................... 142

Table 20 Pseudo ANOVA table for Q20 model (QRT, fixed region and ROI for NSW, VIC

and QLD states) (variables are explained in Table 18 caption)................................. 143

Table 21 Regression diagnostics for the fixed region and ROI for NSW, VIC and QLD. 149

Table 22 Model error variances associated with the fixed region and ROI for NSW, VIC

and QLD (n = number of sites needed for the LP3 parameters and flood quantiles) 151

Table 23 Evaluation statistics (RMSEr and REr) from LOO validation for NSW (Results

NSW for PRT using the weighted regional average standard deviation and skew

models, i.e. no predictor variables given in brackets), VIC and QLD ...................... 153

Table 24 Summary of predictor variables (here log10 is used) .......................................... 162

Table 25 Correlation between the log10 predictor variables used in the analysis .............. 163

Table 26 Results from simulated data, OLSR when 2 = 1............................................... 166

Table 27 Results from simulated data, OLSR when 2 = 0.04.......................................... 168

Table 28 Results from simulated data, GLSR when 2 = 0.903 and )ˆ,ˆ(ˆ ji yy = 0.30 ... 168

Table 29 Results from simulated data, GLSR when 2 = 0.063 and )ˆ,ˆ(ˆ ji yy = 0.70 ... 168

Table 30 OLSR analysis, MSEP values for calibration and validation data set (observed

data from NSW). Here log10 is used .......................................................................... 171

Table 31 GLSR analysis, MSEP values for calibration and validation data set (observed

data from NSW). Here log10 is used .......................................................................... 171

Table 32 OLSR and GLSR analysis for LOO and MCCV for Q10, optimal models shown

along with summary statistics.................................................................................... 171

Table 33 MSEP for ARI = 100-year .................................................................................. 173

PRELIMINARIES

xxiv


along with summary statistics.................................................................................... 173

Table 35 Summary of goodness-of-fit tests for determining parent distribution............... 187

Table 36 Summary of MRE associated with the GEV and P3 distributions ..................... 188

Table 37 Summary of heterogeneity measures for the Australia states............................. 191

Table 38 Coefficients of non linear interpolation from Figure 43..................................... 194

Table 39 Coefficients and R2 values of Ymax polynomial interpolating from Figure 45 ... 199

Table 40 Comparison of the parameters of the parent distribution and the distribution for

the generated data (distribution: F(x)=exp[-1-(x-)/1/]) and correlation

coefficient, ρ. ............................................................................................................. 214

Table 41 Experimental values of Ne for different networks and regions using the real data

(average Ne over the experiment reported) ................................................................ 218

Table 42 Experimental values of Ne for different networks and regions using the simulated

data (average Ne over the experiment reported)......................................................... 219

Table 43 Experimental results in which Ne exceeds N at a particular ARI for different

regions using the real data set .................................................................................... 222

Table 44 Properties of the Constant Ne Spatial dependence model ................................... 228

Table 45 for each pair of sites for the different states/region ....................................... 232

Table 46 Total record length (L) and effective record length (Le) for the all Australian

dataset ........................................................................................................................ 232

Table 47 Coefficients and R2 values of Ymax polynomial interpolation from Figure 56 for N

and Ne sites................................................................................................................. 233

Table 48 CV values for study catchments in Australia...................................................... 237

Table 49 Summary of the finally selected BGLSR models for all the Australian states used

in the validation of LFRM ......................................................................................... 244

Table 50 Regression diagnostics for the ROI approach for the various Australian states and

test catchments ........................................................................................................... 249

Table 51 Summary of error statistics obtained from independent testing associated with the

LFRM model.............................................................................................................. 255

Table 52 Summary of the final BGLSR results for VIC ................................................... 289

Table 53 Summary of the final BGLSR results for QLD .................................................. 290

Table 54 Values of YT corresponding to ARI.................................................................... 311

PRELIMINARIES

xxv

COMMON NOTATIONS

A catchment area

a, b constants

CS coefficient of skewness

CV coefficient of variation

e error term in regression analysis

G regional mean skewness coefficient

H heterogeneity measure

IN identity matrix

k number of parameters in regression model

L total record lengths of all sites in a group

Le record lengths after correcting for spatial dependence

l sample l moment

LSK L coefficient of skewness

LCV L coefficient of variation

LKT L coefficient of kurtosis

N total number of streamflow records used in flood frequency analysis and also used

to define number of simulations undertaken

Ne number of independent stations after correcting for spatial dependence

n total number of datasets in regression analysis

na average number of stations in a region

Nsim number of simulated regions for homogeneity testing

p probability

QT flood quantile having return period of T years

R2 coefficient of determination used in OLSR

T return period

tc time of concentration

X nxk matrix of basin characteristics

kx1 vector of regression coefficients

y vector of dependant variables in regression model

ox row vector of basin characteristics at site 0

PRELIMINARIES

xxvi

2 model error variance

2ˆ sample estimate of model error variance

covariance matrix of regression errors

data based estimate of

2 residual variance from OLSR

prior mean of the model error variance used in BGLSR

Q mean annual flood

i population mean at site i

i population standard deviation i

q sample mean of logs of annual maxima at station i

parameter of probability distribution

TK frequency factor for return period T in log Pearson type 3 flood frequency analysis

correlation between sites used in large flood model regionalisation

i residual error term associated with regression of mean

sampling error covariance matrix

ij correlation/distance relationship between stations

ij estimated correlation/distance relationship between stations

2

GLSRR pseudo coefficient of determination used in GLS regression

PRELIMINARIES

xxvii

ABBREVIATIONS

‘AD’ Anderson Darling Statistic for Homogeneity

ARR-FLIKE Australian Rainfall and Runoff flood frequency analysis software

AUSIFD Australian Intensity Frequency Duration software

AEP Annual Exceedance Probability

AIC Akaike Information Criteria

AMFS Annual Maximum Flood Series data

ARR Australian Rainfall and Runoff

ARI Average Recurrence Interval

ASPE Average Squared Prediction Error

AVPO Average Variance of Prediction for old site

AVPN Average Variance of Prediction for new site

BGLSR Bayesian Generalised Least Squares Regression

BIC Bayesian Information Criteria

BOM Bureau of Meteorology

BPV Bayesian Plausibility Value

Area Catchment area (km2)

CD Compact disk

CMCCV Corrected Monte Carlo Cross validation

FFA Flood Frequency Analysis

forest Fraction of basin covered by medium to dense forest

qsa Fraction quaternary sediment area

GEV Generalised Extreme Value distribution

GLSR Generalised Least Squares Regression

GPA Generalised Pareto distribution

IFM Index Flood Method

ID Instantaneous Discharge

IFD Intensity Frequency Duration

IM Monthly Instantaneous Maximum Data

MMD Monthly Maximum Mean Daily Data

LFRM Large Flood Regionalisation Model

LOO Leave-One-Out Validation

PRELIMINARIES

xxviii

LP3 log Pearson type 3 distribution

MCCV Monte Carlo Cross validation

MCMC Markov Chain Monte Carlo

MEV Model Error Variance

MOM Method of Moments Estimator

ML Maximum Likelihood Estimator

MSEP Mean Squared Error of Prediction

MVN Multivariate Normal Distribution

evap Mean annual evapotranspiration (mm)

rain Mean annual rainfall (mm)

NERC Natural Environment Research Council (UK)

NCWE National Committee on Water Engineering

NSW New South Wales

OLSR Ordinary Least Squares Regression

P3 Pearson type 3 distribution

PM Probabilstic Model

PRM Probabilistic Rational Method

PRT Parameter Regression Technique

PD, POT Peaks over threshold

QLD Queensland

QRT Quantile Regression technique

RR Rating Ratio

TID Rainfall intensity of D-hour duration and T-year average recurrence interval

(mm/hr)

RFFA Regional Flood Frequency Analysis

ROI Region of Influence

C Runoff Coefficient used in rational method

Sden Stream Density (km/ km2)

Slope Slope of central 75% of mainstream S1085 (m/km)

SEP Standard error of prediction

USGS United States Geological Survey

VIC Victoria

WLSR Weighted Least Squares Regression

CHAPTER 1

1

CHAPTER 1: INTRODUCTION

1.1 GENERAL

This thesis focuses on design flood estimation problem in ungauged catchments in the

range of frequent to rare average recurrence intervals (ARIs) (2 to 2000 years) using

regional flood frequency analysis (RFFA) approaches. The RFFA attempts to transfer flood

characteristics information from gauged to ungauged catchments using the concept of

‘homogeneous regions’. This thesis, in particular, investigates the research question how

flood quantile estimation in ungauged catchments can be enhanced by adopting an

ensemble of advanced statistical techniques. The RFFA approaches developed in this thesis

attempt to minimise errors in design flood estimation through a stringent data preparation

scheme, use of sophisticated statistical techniques and an in-depth validation of the

techniques applied. This chapter provides the background, need and objectives of this

research and an overview of the thesis.

1.2 BACKGROUND

The flood phenomenon is a part of the natural disturbance regime, and an intrinsic

component of the natural climate system. It can also be one of the most destructive hydro-

meteorological phenomena in terms of its impacts on human well-being and socioeconomic

activities.

One just has to see the considerable damage caused by flooding that has taken place in the

north-eastern state of Australia (namely Queensland) in 2010, 2011 and 2012 to really

understand the significance of this issue. The death toll is estimated to be 30 – 35 and the

people affected by the flooding reached 2.1 million. The damage caused is estimated to be

around $20 billion. Other losses that are worth noting are the disruption in trade such as the

loss in agriculture and mining, which are both important revenue incomes for Queensland

and Australia. Figure 1 shows the damage caused by flash flooding in Emerald Central

Queensland in January 2011.

CHAPTER 1

2

Figure 1 Flash flooding in Emerald Central Queensland (Oncirculation, 2011)

To estimate the frequency and magnitude of floods for design purposes, the availability of

streamflow data is a fundamental requirement. Flood frequency analysis is often used by

practitioners to support the design of river engineering works, flood mitigation measures

and civil protection strategies. It is generally carried out by fitting peak flow observations

to a suitable probability distribution (Baratti et al., 2012). The estimation of probability of

exceedance for frequent to rare floods is essentially an extrapolation exercise based on

limited observed flood data. Thus the larger the database the more accurate the estimate

should be. From a statistical point of view, estimation from a small sample is likely to give

unreasonable or physically unrealistic parameter sets, especially for the probability

distributions with a large number of parameters (i.e. three or more). In practice, however,

recorded flood data may be quite limited. In many cases, these data may be completely

absent (i.e. ungauged catchment cases). In such situations, RFFA is adopted.

The RFFA serves two purposes, for sites where streamflow data are not available the

analysis is based on regional data (Cunnane, 1989). For sites with available data, the joint

use of data measured at a site, called at-site data, and regional data from a number of

stations in the region provides sufficient information to enable a probability distribution to

be used with greater confidence (Dawdy et al., 2012). This type of analysis represents a

CHAPTER 1

3

substitution of space for time where data from different locations in a region are used to

compensate for short records at a single site (Stedinger et al., 1993).

RFFA consists of three major steps: (a) identification of homogeneous regions; (b)

development of estimation models; and (c) validation of the estimation models. To form

the homogeneous regions, traditional approaches such as geographical and administrative

regions have often been adopted (I. E. Aust., 1987; Acreman and Sinclair, 1986; Acreman,

1987; Tasker et al., 1996 and Eng et al., 2005); however, these regions often lack in

hydrological similarity (Burn, 1990a and 1990b; Hosking and Wallis, 1993; Merz and

Blöschl, 2005 and Chebana and Ouarda, 2008). Regions based on climatic and physical

catchment characteristics have been proposed (Tasker et al. 1996; Bates et al., 1998 and

Rahman et al., 1999a). Moreover, to avoid problems associated with fixed boundaries, the

region of influence (ROI) approach has been adopted (Burn, 1990a and 1990b; Tasker et

al., 1996; Zringi and Burn, 1996; Merz and Blöschl, 2005; Eng et al., 2007a, b and Gaál et al., 2008). One critical issue here is how to assign an ungauged catchment to the

appropriate region when there is more than one possible region (Bates et al., 1998).

In relation to the estimation model, several approaches have been proposed. They include

the probabilistic rational method (PRM), the index flood method (IFM) and the quantile

regression technique (QRT). In south–east Australia, the PRM is recommended for general

use in Australian Rainfall and Runoff (ARR) mainly due to its simplistic nature (I. E. Aust.,

1987). The essential component of this method is a dimensionless runoff coefficient, which

ARR assumes to vary smoothly over geographical space. This assumption may not be

satisfied in many cases, because two nearby catchments can exhibit quite different physical

features. Also, values associated with these runoff coefficients are estimated using

conventional moment estimates with flow records of limited length (some sites had only 10

years of record in the analysis with the ARR1987 RFFA methods). This means that these

runoff coefficient values are affected by severe sampling variability, which can then

introduce significant bias and uncertainty into the final design flood estimates. Criticism

has also been linked to the way the runoff coefficients are mapped; this can be attributed to

the assumption of geographical contiguity as a surrogate to hydrological similarity, an

assumption open to wide criticism. It is also worth mentioning the lack of independent

validation with the PRM in ARR1987.

CHAPTER 1

4

The IFM or index frequency approach (being applicable to both flood and rainfall

estimation) (e.g. Fill and Stedinger, 1998; Madsen et al., 2002; Bocchiola, et al., 2003;

DiBaldassarre et al., 2006 and Lim and Voeller, 2009) has been a popular approach for

estimating flood quantiles since 1960 (Dalrymple, 1960). ARR (I. E Aust., 1987) did not

favour the IFM as a design flood estimation technique for Australia. The IFM had been

criticised on the grounds that the coefficient of variation of the flood series may vary

inversely approximately with catchment area, thus resulting in flatter flood frequency

curves for larger catchments. In the United Kingdom (UK), an index flood method is

currently recommended in the Flood Estimation Handbook (FEH) where the index flood is

taken as the median annual maximum flood. The growth curve for any site is estimated

using a pooling group, which is formed using catchments considered to be hydrologically

similar to the site of interest. The FEH recommends the generalised logisitic (GLO)

distribution combined with the method of L-moments for growth curve estimation. The

FEH RFFA approach was upgraded again by Kjeldsen et al. (2008) and as documented in a

series of papers (e.g. Kjeldsen and Jones, 2009a, 2009b and 2010).

The United States Geological Survey (USGS) proposed a QRT where a large number of

gauged catchments are selected from a region and flood quantiles are estimated from

recorded streamflow data, which are then regressed against catchment variables that are

most likely to govern the flood generation process (Benson, 1962). It has been noted that

the method can give design flood estimates that do not vary smoothly with ARI; however,

hydrological judgment can be exercised in situations such as these where flood frequency

curves can be adjusted to increase smoothly with ARI.

As an alternative to the QRT, the parameters of a probability distribution can be regressed

against the explanatory variables (Tasker and Stedinger, 1989; Madsen et al., 2002; Reis et

al., 2005; Overeem et al., 2009). In the case of the LP3 distribution, regression equations

can be developed for the first three moments i.e. the mean, standard deviation and

skewness for logarithms of the annual maximum flood series. This method here is referred

to as the ‘parameter regression technique’ (PRT). There has been no detailed comparison of

the QRT and PRT for ungauged catchments.

The ordinary least squares regression (OLSR) estimator has traditionally been used by

hydrologists to estimate the regression coefficients () in regional hydrological models (for

CHAPTER 1

5

both the QRT and PRT). But in order for the OLSR model to be statistically efficient and

robust, the annual maximum flood series in the region must be uncorrelated, all the sites in

the region should have equal record length and all estimates of ARI year events have equal

variance. Since the annual maximum flow data in a region do not generally satisfy these

criteria, the assumption that the model residual errors in OLSR are homoscedastic is

violated.

Stedinger and Tasker (1985, 1986) developed a generalised least squares regression

(GLSR) model for regional hydrologic regression. The important difference in the GLSR

from the OLSR models lies in the development and partitioning of the covariance matrix of

the errors. The GLSR model of Stedinger and Tasker (1985) assumes that the total error

results from two sources: model errors and sampling errors (Tasker and Stedinger, 1989;

Pandey and Nguyen, 1999; Griffis and Stedinger, 2007; Gruber and Stedinger, 2008 and

Micevski and Kuczera, 2009). This is due to the fact that record lengths vary significantly

from site to site and that the flood data are cross correlated spatially. The GLSR procedure

can result in notable improvements in the precision with which the coefficients of regional

hydrologic regression models can be estimated.

Furthermore, Reis et al. (2003 and 2005) introduced a Bayesian approach to the

coefficients estimation for the GLSR (BGLSR) regional regression model developed by

Stedinger and Tasker (1985) for hydrological analysis. The results presented in Reis et al.

(2005) show that for cases in which the model error variance is small compared to

sampling error of the at–site estimates, which is often the case for regionalisation of a

shape parameter, the Bayesian estimator provides a more reasonable and generally less

biased estimates of the model error variance than the method of moments and maximum

likelihood estimators. The Bayesian approach can also provide a realistic description of the

possible values of the model error variance (Reis et al., 2005; Micevski and Kuczera, 2009;

Haddad et al., 2012 and Haddad and Rahman, 2012). It is advantageous to provide a full

posterior distribution of the quantity of interest (flood statistic, e.g. mean flood and flood

quantile) which is done by the Bayesian approach as compared to classical methods which

usually give a point estimate of the quantity of interest (Congdon, 2001).

Given the above advantages, the BGLSR is applied in this thesis using both the QRT and

PRT estimation techniques. This is carried out in a ROI framework.

CHAPTER 1

6

Validation is generally used to assess a model’s performance in hydrologic regression

analyses (Sun et al., 2011 and Tsakiris et al., 2011). The validation procedure has some

appealing and important properties, for instance, it assists in the selection of an appropriate

model according to its prediction ability for the gauged sites, while at the same time it

evaluates the prediction ability of the model for possible ungauged catchments. In the most

commonly adopted validation approach, a fixed percentage of the data (e.g. 10% or 20%) is

left out while building the model, and then the developed model is tested on the left out

data (Stone, 1974; Michaelsen, 1987 and Xu et al., 2005). This type of ‘split sample’

validation approach has limitations, as it often provides inadequate validation when the full

data set is not used in the validation following a random and unbiased fashion. To make use

of all the available sites in the validation using a more efficient and random manner, two

validation approaches are tested in this thesis, which are the leave-one-out (LOO) and

Monte Carlo Cross validation (MCCV) techniques (Xu and Liang, 2001; Xu et al., 2005

and Sun et al., 2011). Both the LOO and MCCV validation techniques are used with real

and simulated flood datasets in the frameworks of the commonly applied OLSR approach

and the more powerful GLSR approach.

Large to rare flood frequency analysis is a remarkably challenging task. One often needs to

estimate floods with an annual exceedance probability (AEP) much smaller than 1%, while

the streamflow record lengths are usually much shorter, being between 20 and 100 years in

most places in Australia. For example, average record lengths in the Australian RFFA

database is around 33 years, as reported in Rahman et al. (2009). Hence, it is generally the

case that the required flood magnitudes in the rare range have hardly been recorded,

meaning significant extrapolations from the available flood data are needed. It is therefore

not surprising that a suitable statistical approach is required to estimate large to rare floods

with a reasonable degree of consistency. This thesis therefore proposes a new large flood

regionalisation model, which also takes into account inter-site dependence (i.e. the number

of independent sites (Ne), which reduces the net information available for regional

analysis).

CHAPTER 1

7

1.3 THE NEED FOR THIS RESEARCH

Australia is a large continent with many streams; many of which are ungauged or have little

recorded flood data. For example, out of the 12 drainage divisions in Australia, seven do

not have a single stream with 20 or more years of recorded flood data (Vogel et al., 1993).

Therefore, RFFA techniques are quite important for Australia, as they can provide

reasonably accurate design flood estimation in these ungauged or poorly gauged

catchments. The sizing of minor hydraulic structures such as culverts, farm dam and

embankments in small ungauged catchments is a common task faced by practising

engineers. The average amount spent on these projects per year was estimated at

approximately $250 million as at 1985 (Flavell, 1985; Pilgrim, 1986); this is equivalent to

about $750 million per annum in 2012 (based on long term CPI series for Australian capital

cities, ABS, 2012).

Australian Rainfall Runoff, Book 4 (I. E. Aust., 1987) states that almost 50% of Australia’s

annual expenditure on projects requiring design flood estimation is on small to medium

sized ungauged catchments. These small catchments typically have an upper limit of 25

km2, while medium sized catchments have an upper limit of 1000 km2 (I. E. Aust., 1987).

Given the economic significance, the design flood estimates in small to medium ungauged

catchments need to be as accurate as possible, since under and over-estimation is associated

with higher flood damage costs and increased construction costs, respectively. Both of

these situations are undesirable.

There have been many RFFA techniques which have been proposed and used over the

years. It is well understood amongst the researchers in hydrology that some of these

approaches (such as the PRM in ARR 1987) is not based on hydrologically and statistically

meaningful rationale. Most of these methods are likely to introduce significant error in

flood quantile estimates. As there are flaws in these empirical approaches, further research

is needed to develop more reliable alternatives that can provide more accurate design flood

estimates along with estimation uncertainty.

Currently within Australia there is no one universally accepted RFFA method that can be

applied with confidence; instead, there are many local approaches, which have hardly been

CHAPTER 1

8

rigorously validated. For example, ARR (I. E. Aust., 1987) have made recommendations

that vary from state to state; these being the PRM, IFM and a variety of other empirical

approaches such as the synthetic unit hydrograph and the Main Road’s methods for the

State of Queensland. Since ARR came out in 1987, there have been some notable

advancements in at–site and RFFA (see Chapter 2 for more details) in Australia and

internationally (e.g. Stedinger and Tasker, 1985 and 1986, Tasker and Stedinger, 1989;

Kuczera, 1999a; Bates et al., 1998; Rahman, 2005 and Micevski and Kuczera, 2009). There

have been a number of studies that have dealt with different forms of RFFA and ways of

reducing errors in quantile estimates (e.g. Stedinger and Tasker, 1985 and 1986, Tasker et

al., 1996; Burn 1990 and 1990b; Zringi and Burn, 1996, Reis et al., 2005 and Griffis and

Stedinger, 2007). These methods essentially deal with ways to increase sample size (by

pooling hydrological data), reduce heterogeneity and sampling errors. In Australia, there is

also the extra benefit of having over 20 years of additional streamflow data (since the

publication of ARR1987) at many gauged sites that can be incorporated into the new RFFA

techniques. This is likely to reduce uncertainty in design flood estimates for the ungauged

catchments.

It can therefore be stated that there is a need for the development and testing of new RFFA

methods using the most updated flood data for Australia. This thesis embarks on these

tasks, which is likely to form the scientific basis of recommending new RFFA methods in

the upcoming revision of the ARR.

1.4 RESEARCH QUESTIONS

This thesis, in particular, examines the following research questions in the context of

RFFA:

1. How to reduce the uncertainty in flood data to be used in RFFA modelling by

employing rigorous data preparation and checking techniques?

2. How to deal with a high level of regional heterogeneity in RFFA (found by other

researchers in Australia)?

3. How to form acceptable regions in Australia where the degree of heterogeneity has

been found to be quite high?

CHAPTER 1

9

4. Whether regression-based approaches can be adopted in Australia to develop

statistically sound regional flood estimation models?

5. Whether the use of sophisticated statistical techniques such as BGLSR combined

with ROI can help to reduce the uncertainty in design flood estimates and thereby to

form the basis of uncertainty estimation in RFFA?

6. How a more rigorous validation approach (LOO or MCCV) can be applied with

RFFA methods?

7. How a new RFFA method can be developed for design flood estimation in the large

to rare flood ranges that explicitly accounts for spatial dependence in the annual

maximum flood series data and that can also be applied relatively easily in practice.

1.5 MAJOR TASKS

The research questions presented in Section 1.4 are answered/ investigated in this thesis by

undertaking the following major tasks:

(i) Prepare a critical literature review to ascertain the current state of knowledge in

RFFA techniques with a focus to identify gaps and limitations in the current

research and thereby to formulate research questions to be investigated in this

thesis.

(ii) Prepare an Australian national flood and catchment database that can be used in

the proposed research which mostly satisfies the principal assumptions in

RFFA.

(iii) Develop the regression based RFFA techniques such as BGLSR-QRT and

BGLSR-PRT for the design flood estimation in the small to frequent flood

range (ARIs of 2 to 100 years). Also, compare the BGLSR-QRT and BGLSR-

PRT methods in the fixed region and ROI frameworks.

(iv) Compare two validation approaches, LOO and MCCV, thereby assessing their

applicability to RFFA.

(v) Develop the LFRM for regional flood estimation in the large to rare flood

ranges (ARIs of 100 to 2000 years) using a comprehensive Australian dataset.

CHAPTER 1

10

(vi) Develop a generalised spatial dependence model to account for the inter-site

dependence (also known as spatial dependence) of annual maximum flood

series data in the application of the LFRM. Benchmark the developed LFRM

using a split-sample validation and by comparing it with the results from

alternative RFFA methods.

1.6 CONTRIBUTIONS OF THIS RESEARCH TO THE UNDERSTANDING

OF THE RFFA PROBLEM

This thesis attempts to make best use of the available streamflow data by developing

efficient regional data pooling methods and high-end statistical techniques. This focuses on

building a quality-controlled database as well as development of appropriate model

validation techniques. It develops a Bayesian GLS regression procedure with ROI approach

to tackle the excessive regional heterogeneity and to deliver more efficient parameter

estimation techniques of the adopted regional prediction equations. This thesis covers the

frequent and rare flood estimation problems ranging from 2 to 1000 years ARI, which can

possibly be extended to 2000 years ARI.

The thesis has made a notable contribution in the regional flood frequency analysis

research field as evidenced by the publication of 9 refereed journal papers (two in former

ERA ranked A*, two in A, 5 in B category journals). These published papers are listed in

Appendix A.

1.7 OUTLINE OF THE THESIS AND CHAPTER INTRODUCTIONS

The investigations carried out in this research are presented across 9 chapters, as described

below.

Chapter 1 gives a brief introduction to the overall study, highlighting the background and

need for this research. The research questions to be investigated and major tasks to be

undertaken to answer the research questions are also identified.

Chapter 2 provides a critical literature review on the various aspects of RFFA. On the onset

of this chapter, the basic issues related to assumptions in flood frequency analysis,

CHAPTER 1

11

distributional choices, regional homogeneity and spatial dependence are discussed.

Different RFFA methods which include the IFM, Station-year approach, Bayesian and

Monte Carlo methods and the PRM are presented. The QRT and PRT are discussed in

more detail with more emphasis on GLSR and BGLSR. The ROI approach is also critically

reviewed in relation to its use in previous applications and its relevance to this study. The

second part of this chapter discusses model validation in hydrological regression analysis.

A brief history of model validation is presented from a wide range of statistical applications

along with previous applications in the hydrological field. Finally, a brief history of large

flood estimation is given with examples of some of the methods currently used in Australia

and internationally. Overall, this chapter gives a summary of the merits and disadvantages

of each approach, thereby laying the foundation for the proposed research.

Chapter 3 describes the statistical techniques adopted in this thesis for the estimation of

design floods in the small to medium (frequent) ARI range and for the validation of

regional hydrological regression models. On the onset of this chapter a flow chart is

provided which illustrates the statistical techniques used in the thesis. Estimation of at-site

flood frequency is outlined using the LP3 distribution in a Bayesian framework. The

classical formulation of the GLSR problem found in the Econometrics field is presented to

provide an overview of the method. The chapter then goes on to provide the formulations

of the GLSR model by Stedinger and Tasker (1985 and 1986) for use in hydrological

regression analysis. The Bayesian methodology is outlined in greater detail for use with the

GLSR approach, hence the classical Bayesian formulation is also summarised.

The quasi-analytic Bayesian approach as outlined by Reis et al. (2005) for the

regionalisation of shape parameters is expanded on by developing a BGLSR model for the

regionalisation of quantiles and parameters of the LP3 distribution (QRT and PRT,

respectively). This methodology includes formulation of the likelihood function, the prior

distributions of the β coefficients and the model error variance of the regression model for

the QRT and PRT. Setting up the error covariance matrices, which are vital for the solution

of the BGLSR equations, are also presented. The steps and formulation involved in

selecting the best predictor variables for use with the BGLSR are outlined. The ROI

framework is then described in the light of its application for regionalising the parameters

and quantiles of the LP3 distribution. All the statistical diagnostics and formulation

regarding the residual analysis are also outlined in sufficient detail, along with the

CHAPTER 1

12

statistical measures of model performance. Also, a step by step framework for regional

uncertainty analysis is presented for obtaining confidence limits with regional flood

estimates.

In the second part of Chapter 3 the mathematical and statistical techniques related to model

validation for use with regional hydrological regression is outlined. Firstly the hydrological

regression problem is defined. The formulations regarding the LOO and MCCV validation

techniques are derived. Finally, the details regarding the statistical techniques for

generating the simulated data for testing with the LOO and MCCV are discussed in detail.

The assembly of streamflow data is an important step in any RFFA study. Chapter 4

describes various aspects of streamflow data collation such as selection of the study

catchments, filling of gaps in the annual maximum flood data series, testing the data for

any suspected trends (as one of the assumptions of flood frequency analysis is that the data

must exhibit stationarity and be homogenous), exploring rating curve errors associated with

the annual maximum flood data (flood data often has notable error associated with it, hence

identification of this is important) and checking for outliers (both low and high outliers

may be present in annual maximum flood data, these should be identified and treated

accordingly). This chapter also presents the final set of catchments to be used in this thesis.

Chapter 4 also covers the selection of the climatic and physical catchment characteristics

variables that govern flood generation process and can be used in RFFA models.

Chapter 5 integrates the techniques provided in Chapter 3 into a practical BGLSR regional

hydrologic regression framework, which is able to address the issues relevant to the

estimation of flood quantiles and statistics in an efficient manner. Chapter 5 also presents

the results associated with the RFFA for small to medium range ARIs looking at the

differences between fixed region and ROI frameworks for both the BGLSR-QRT and PRT

methods. The results are illustrated for the states of Tasmania, New South Wales (NSW),

Victoria and Queensland. The advantages of the BGLSR-ROI are outlined in sufficient

detail.

CHAPTER 1

13

Chapter 6 presents the results of the comparison of two model validation techniques, the

LOO and MCCV in a hydrological regression framework for the state of NSW. Both the

OLSR and GLSR are applied to simulated and real datasets. This chapter also illustrates

through detailed examples the overall advantages and disadvantages of the proposed

methods for model selection and validation in RFFA.

Chapter 7 presents the estimation of floods in the large to rare flood range. The

methodology, detailed investigation and results associated with the LFRM are discussed in

detial. Chapter 7 begins with a brief discussion on the LFRM concept which is based on the

Station-year approach. The issue of inter-site dependence in general is discussed in the

light of the application of the LFRM. The chapter also discusses the comprehensive

Australian annual maximum dataset used for the analysis. The issues of identification of a

probability distribution and homogeneity in the context of LFRM are investigated and

discussed. The theory and development of the LFRM is outlined assuming spatial

independence initially. This chapter also outlines a methodology for deriving simulated

data which is used for estimating the effective number of independent sites, as it was

recognised that the observed data had limitations relating to sampling variability and

homogeneity issues.

Chapter 8 illustrates how the effect of inter-site dependence is tackled by introducing the

‘effective number of sites (Ne) concept’. The steps and formulation needed for determining

the typical degree of spatial dependence in a network or region is discussed in detail. The

estimation of Ne is then derived assuming a generalised extreme value (GEV) distribution

with a simple model that ignores possible variation with ARI.

The results are then discussed and compared in detail for Ne for both the real and simulated

data sets. Both the derived results helped to establish the behaviour of Ne in a network and

region for the analysis. The procedure for generalising the spatial dependence is provided

along with the comprehensive results from this investigation. As such the LFRM was

revisited using the newly developed spatial dependence model and applied to ungauged

catchments by developing prediction equations using BGLSR for the mean flood and

coefficient of variation of annual maximum floods.

CHAPTER 1

14

A summary, conclusion and recommendations for further research are presented in Chapter

9.

There are four (4) appendices, as follows. Appendix A presents the refereed journal papers

that have been published or that are under review based on the research presented in this

thesis. Appendix B presents additional results associated with Chapter 5. Appendix C

presents additional results associated with Chapter 7 and 8 while Appendix D provides

some extra details on the homogeneity tests used in Chapter 7.

CHAPTER 2

15

CHAPTER 2: REVIEW OF REGIONAL FLOOD FREQUENCY

ANALYSIS TECHNIQUES, MODEL VALIDATION AND LARGE

FLOODS

2.1 GENERAL

The aim of this chapter is to review previous studies on regional flood frequency analysis

(RFFA) techniques with a particular emphasis on the estimation of flood quantiles in the

range of average recurrence intervals (ARIs) of 2 – 100 years in relation to the quantile and

parameter regression techniques. The concepts of fixed region and region of influence

(ROI) approaches are discussed and past applications are presented. This chapter also

reviews previous studies on the validation of regression models especially in the area of

hydrology, with an emphasis in the area of hydrological regression. Finally this chapter

also reviews past studies in the area of large to rare flood estimation. Both the advantages

and limitations of the methods presented are also outlined.

At the beginning, the basic issues on RFFA such as regional homogeneity, inter-site

dependence, and distributional choices are reviewed. A brief discussion is then presented

on identifying homogenous regions based on annual maximum flood series. The review of

RFFA methods as outlined above is then presented. A summary of the findings from this

review is given at the end of the chapter.

2.2 BASIC ISSUES

2.2.1 REGIONAL FLOOD FREQUENCY ANALYSIS

The availability of streamflow data is an important aspect in any flood frequency analysis.

The estimation of probability of occurrence of floods in the credible limit range (ARIs 2 –

100 years) and beyond credible limit (large to rare floods) is an extrapolation based on

limited recorded flood data. Thus, the larger the recorded data set, the more accurate the

estimates will be. From a statistical view point, estimation from a small sample may give

unreasonable or physically unrealistic parameter estimates, especially for distributions with

a large number of parameters (three or more). Large variations associated with small

sample sizes cause the estimates to be uncertain and biased. In practice, however, data may

be limited or in some cases may not be available for a site. In such situations, RFFA is

most useful.

CHAPTER 2

16

RFFA is a technique of transferring information from gauged suites to ungauged sites.

RFFA serves two purposes. For sites where data are not available, the analysis is based on

regional data (Cunnane, 1989). For sites with limited data, the joint use of data recorded at

a site, called at-site data, and regional data from a number of stations in a region provides

sufficient information to enable a probability distribution to be used with greater reliability.

This type of analysis represents a substitution of space for time where data from different

locations in a region are used to compensate for short records at a single site (National

Research Council, 1988; Stedinger et al., 1993).

2.2.2 REGIONAL HOMOGENEITY

RFFA is based on the concept of regional homogeneity which assumes that annual

maximum flood populations at several sites in a region are similar in statistical

characteristics and are not dependant on catchment size (Cunnane, 1989). Although this

assumption may not be strictly valid, it is convenient and effective in most applications.

One of the simplest RFFA procedures that have been used for a long time is the index flood

method (IFM). The key assumption in the IFM is that the distribution of floods at different

sites within a region is the same except for a site-specific scale or index flood factor.

Homogeneity in regards to the index flood relies on the concept that the standardised

regional flood peaks have a common probability distribution with identical parameter

values. The identification of homogenous regions is an elementary step in RFFA (Bates et

al., 1998). The application typically involves the allocation of an ungauged catchment to an

appropriate homogenous group and the prediction of flood quantiles using developed

models based on catchment characteristics (Bates et al., 1998). That is, the RFFA based on

homogenous regions can transfer the information from similar gauged catchments to

ungauged catchments to allow for flood prediction.

There have been many techniques developed which attempt to establish homogenous

regions. For example the probabilistic rational method (PRM) uses geographical contiguity

as an indication of homogeneity that is the catchments which are nearby to each other

should have similar runoff coefficients (I. E. Aust., 1987).

CHAPTER 2

17

Looking at homogeneity from a theoretical point of view, two catchments’ annual

maximum flood series may be treated as homogenous with respect to flood behaviour if

they both satisfy two criteria: the inputs (such as rainfall) to the hydrological systems are

identical, and the climatic and physical characteristics changing the input to flood peak are

the same. No two catchments can satisfy these criteria perfectly based on the fact that each

catchment has unique physical characteristics and that each catchment has different

climatic inputs. In the search for practical homogeneity, one has to make decisions on the

degree of similarity or dissimilarity that is acceptable to identify a cut-off point where a

region is acceptably homogenous or heterogeneous, in consideration of the practical

applications of the RFFA techniques.

In defining homogenous regions for use in RFFA, a balance has to be made between

including more sites for increased information and maintaining an acceptable level of

homogeneity. In most situations when more sites are added to a region, certainly more

information is gained about the flood regime; however sites that are hydrologically

dissimilar can increase the heterogeneity in the region.

2.2.3 INTER – SITE DEPENDENCE

Some RFFA methods make use of inter–site dependence (see also section 2.4.1) while

others do not. Inter–site dependence as reported by Cunnane (1988) states that streamflow

data points across a region will show similar behaviour within any given timeframe. This

means that;

1) In some years the annual maximum flows at all sites are due to a single widespread

meteorological event.

2) In relatively dry years, peak flows are generally low over the entire region, in which

case all annual maxima will be low.

To be able to counteract these trends in RFFA, previous studies have indicated that a

concurrent record of sufficient length should be adopted (Stedinger, 1983).

Inter-site dependence can be viewed as disadvantageous, as it reduces the value of

additional information for regional analysis, i.e. inter-site dependence limits the increase of

information from an increase in the number of stations in a region. On the other hand, it is

beneficial to the derivation of flood quantiles for ungauged sites, as it allows transfer of

CHAPTER 2

18

information from gauged to ungauged sites. The effects of inter-site dependence on large

flood estimation are discussed in more detail in Chapter 7.

2.2.4 DISTRIBUTIONAL CHOICES

Selection of an appropriate probability distribution to be used in flood frequency analysis is

of prime importance in at-site and RFFA. It has also been a topic of interest for a long time

and one that is filled with controversies (Bobée et al., 1993). Selecting a probability

distribution has received widespread attention by many researchers. The recent literature in

this field is wide and varied and has been characterised by a proliferation of mathematical

models, which lacks in theoretical justification but is applied in a simplistic manner to

estimate flood flows. Benson (1968) and NERC (1975) devote considerable attention to

this problem. Cunnane (1989) summarised the distributions commonly used in hydrology,

mentioning 14 different distributions. Kidson and Richards (2005) present an informative

summary on the assumptions and alternatives for distributional choices. They cover aspects

such as data choice, model choice and alternatives and the inclusion of historical and

paleoflood data see (Stedinger and Cohn, 1986; Jin and Stedinger, 1989; Pilon and

Adamowski, 1993; Salas et al., 1994; Cohn et al., 1997; Kuczera, 1999; Martins and

Stedinger, 2001; O’Connell et al., 2002; and Reis and Stedinger, 2005). These studies

generally show that the use of historical information can be of great value in the reduction

of the uncertainty in flood quantile estimates.

In some countries, a common distribution has been recommended to achieve uniformity

between different design agencies. The U.S.A. Interagency Advisory Committee on Water

Data (IACWD, 1982) and the Institution of Engineers Australia (I. E. Aust., 1987)

recommend the log Pearson type 3 (LP3) distribution for use in the United States and

Australia, respectively. Other distributions that have received considerable attention

include the extreme value types 1, 2, 3 (EV1, 2 or 3), generalised extreme value (GEV)

(NERC, 1975), wakeby (Houghton, 1978), generalised pareto (GPA) (Smith, 1987), two-

component extreme value (Rossi et al., 1984) and the log-logistic distribution (Ahmad et

al., 1988).

The use of a standard distribution has been criticised by Wallis and Wood (1985) and

Potter and Lettenmaier (1990). They argue that a reassessment of the use of the LP3

distribution for practical flood design is overdue. Vogel et al. (1993) studied the suitability

CHAPTER 2

19

of a number of distributions (including the LP3) for Australia. They found that the GEV

and wakeby distributions provide the best approximation to flood flow data in the regions

of Australia that are dominated by rainfall during the winter months; for the remainder of

the continent, the GPA and wakeby distributions provide better approximations. For the

same data set, the LP3 performed satisfactorily, but not as well as either the GEV or GPA

distribution. The distributions that have attracted the most interest as possible alternatives

to the LP3 are the GEV and wakeby (Bates, 1994). Studies by Rahman et al. (1999b) and

Haddad and Rahman (2008) showed that GEV-LH moments method provide better results

than the LP3 distribution in South–east Australia in particular for New South Wales (NSW)

and Victoria.

Laio et al. (2009) presented a procedure to identify suitable probability distributions for

hydrological extremes. The objective of this study was to verify the most appropriate

distribution using various goodness-of-fit tests. This study used real (data from the United

Kingdom) and simulated data. It was found that no distribution gave the best fit, however

the model selection tests were a step forward to identifying the most suitable probability

distribution. More recent studies by Haddad and Rahman (2011) – (Journal paper can be

found in Appendix A) compared seven probability distributions (EV1, log normal (LN),

normal (NORM), GEV, Pearson type 3 (P3), LP3 and EV2) for the state of Tasmania.

Using the model selection based on the Aikake information criterion (AIC), Bayesian

information criterion (BIC) and the modified Anderson Darling test (AD) as outlined by

Laio et al. (2009), they showed that the LN distribution with the Bayesian parameter fitting

procedure provided more reliable results in terms of bias and standard error than the

competing models for Tasmania.

2.3 METHODS FOR IDENTIFICATION OF HOMOGENEOUS REGIONS

The methods for obtaining homogenous regions are based on either geographical contiguity

or flood characteristics alone or catchment characteristics alone. The theoretical aspects,

limitations and associated problems with identification of homogenous regions based on

flood data (annual maximum series) are discussed below.

In this approach, the degree of homogeneity of a proposed group is judged on the basis of a

dimensionless coefficient of the annual maximum flood series, such as the coefficient of

variation (CV), coefficient of skewness (CS) or similar measures. Examples are given by

CHAPTER 2

20

Dalrymple (1960), Wiltshire (1986a), Acreman and Sinclair (1986), Vogel and Kroll

(1989), Chowdhury et al. (1991), Pilon and Adamowski (1992), Lu and Stedinger (1992),

Hosking and Wallis (1993) and Fill and Stedinger (1995a, b).

Dalrymple (1960) proposed a homogeneity test based on the sampling distribution of the

standardised 10 year annual maximum flow, assuming an EV1 distribution. Wiltshire

(1986a, b) presented a test based on the sampling distribution of CV to judge the degree of

homogeneity in a region. He tested the efficiency of the proposed test on simulated data

and concluded that “it is clear that the test in its present form is unsuitable for use in

assessing regional homogeneity”. Acreman and Sinclair (1986) used a likelihood ratio test

based on the assumption of an underlying GEV distribution.

Hosking and Wallis (1991, 1993) proposed a heterogeneity measure based on the L

moment ratios L coefficient of variation (LCV), L coefficient of skewness (LSK) and L

kurtosis (LKT). The advantages of this test are that it is based on L moments and not

distribution-specific like those mentioned above. This test has received considerable

attention since its inception (e.g. Pearson, 1991; Thomas and Olsen, 1992; Alila et al.,

1992; Guttman, 1993; Zrinji and Burn, 1996; Bates et al.,1998; Rahman et al.,1999b,

Kjeldsen and Rosbjerg, 2002; Madsen et al., 2002; DiBaldassarre et al., 2006; Castellarin et

al., 2007; Chebana and Ouarda, 2008 and Gaume et al., 2010),

Cunnane (1988) mentioned that identification of a homogeneous region is necessarily

based on statistical tests of hypothesis, the associated power of which, with currently

available amounts of hydrological data, is low. Thus it is not possible to divide, with great

assurance, a large number of catchments into homogeneous subgroups using flow records

with limited lengths. Indeed from an Australian perspective homogeneity cannot always be

satisfied (e.g. Haddad, 2008; Haddad and Rahman, 2012; Ishak et al., 2011 and Rahman,

1997). With the existence of large predictive uncertainty, short record lengths and the

heterogeneity that plagues Australian catchments, flood estimation methods that can deal

with heterogeneity and predictive uncertainty in an efficient manner are needed.

CHAPTER 2

21

2.4 REGIONAL FLOOD FREQUENCY ANALYSIS METHODS –

DIFFERENT APPROACHES

There are a number of RFFA methods based on streamflow data that have been reported.

Some of the most commonly used methods are discussed below.

2.4.1 INDEX FLOOD METHOD

The index flood method (IFM) is a regional frequency approach for transferring flood or

rainfall characteristics information from a group of gauged sites to an ungauged site of

interest (Dalrymple, 1960; Madsen et al., 2002 and Baldassarre et al., 2006). The

estimation of a flood quantile by the IFM can be expressed by:

TT ZQ (2.1)

Where is the scaling factor and is called the index flood, and TZ is a dimensionless

growth factor (or growth curve). In many cases the index flood is taken to be the mean of

the annual flood maximum flood series, which is a site specific value; while the growth

factor is assumed to be constant for the entire homogenous region under consideration.

In the IFM, the dimensionless regional growth curve is used to estimate ZT. The flood

quantile having an ARI of T year is then obtained from Equation 2.1. In the case of a

gauged site, the at-site mean flood is used in Equation 2.1; for an ungauged site, is

estimated using regional information. Equation in 2.1 is based on the following variables:

QT is the flood quantile at a site, with an ARI of T years;

ZT is the regional growth factor, which defines the frequency distribution common to all

the sites in a homogenous region; and

is known as the index flood, which is typically represented (in gauged catchments) by

the mean of the at–site annual maximum flood series. Being used as a scale parameter, it is

recognised as the term which dictates the difference in quantiles between individual sites

within the homogenous region.

CHAPTER 2

22

When the IFM is to be applied to the ungauged catchment case where there is no data

available the difficulty in estimating becomes evident. Estimation such as this is

typically performed via multiple regression between the mean annual flood now noted by

(Q ) and catchment and climatic characteristics (catchment characteristics) within the

region (e.g. Fill and Stedinger, 1998). The general form of this regression equation can be

expressed as:

Q aB C Db c d ... (2.2)

where B, C, D, … are catchment characteristics and a, b, c, d, … are parameters of the

regression equation estimated by either ordinary or generalised least squares regression

(OLSR and GLSR) (The GLSR method is discussed in more detail in section 2.5).

The IFM or index frequency approach (being applicable to both flood and rainfall

estimation) (e.g. Madsen et al., 2002 and DiBaldassarre et al., 2006) has been a popular

approach for estimating flood quantiles since 1960 (Dalrymple, 1960). The assumption is

made that the distribution of floods at different sites within a homogeneous region is the

same except for a site-specific scale or index flood factor. Homogeneity with regards to the

index flood relies on the concept that the standardised flood peaks from individual sites in

the region follow a common probability distribution with identical parameter values. From

all the methods to be discussed in this report, the IFM involves the strongest assumptions

on homogeneity.

Australian Rainfall and Runoff (ARR) (I. E Aust., 1987) did not favour the IFM as a design

flood estimation technique for Australia. The IFM had been criticised on the grounds that

the coefficient of variation of the flood series may vary approximately inversely with

catchment area, thus resulting in flatter flood frequency curves for larger catchments. This

had particularly been noticed in the case of humid catchments that differed greatly in size

(Dawdy, 1961; Benson, 1962; Riggs, 1973; Smith, 1992).

The IFM further developed in the late 1980’s is a vast improvement to the past

methodologies, which uses regional average values of LCV and LSK with the at-site mean

to fit a GEV or an alternative distribution (Hosking and Wallis, 1997). Hosking and Wallis

CHAPTER 2

23

(1993) demonstrate that this approach is efficient when the region is relatively

homogeneous and record lengths are relatively short. Alternatively, a regional GEV shape

parameter can be adopted based upon a regional average (Stedinger and Lu, 1995; Hosking

and Wallis, 1997 and Fill and Stedinger, 1998). This approach is more attractive than the

typical index frequency method when record lengths and regional heterogeneity increase,

but at-site data is sufficient to define the at-site LCV, but not long enough to resolve the

shape parameter (LSK). The efficiency of using either the regional value or the at-site

estimator clearly depends on the sample size. An obvious and natural solution is to

combine the at-site and the regional estimators based on the precision of each estimator.

This approach has been proposed before, for instance, Bulletin 17B (IAWCD, 1982)

recommends that a regional estimate of the shape parameter of the LP3 distribution be

combined with the at-site estimate, to obtain a more precise estimator (see for example,

Griffis and Stedinger, 2004). Similarly, Fill and Stedinger (1998) have proposed such an

extension to the original IFM.

More recent studies in Australia, (Bates et al., 1998; Rahman et al., 1999a), assigned

ungauged catchments to a particular homogenous group identified (through the use of L

moments, (Hosking and Wallis, 1993)) on the basis of catchment and climatic

characteristics as opposed to geographical proximity. However the deficiencies in this

approach were evident in that it needed 12 catchment/climatic descriptors to be used.

Therefore its practical use is somewhat limited by its complexity and the time needed to

gather the relevant data. On an international scale Fill and Stedinger, 1998 demonstrated

that the IFM can provide improved quantile estimation, when different sources of errors are

reduced by including explicitly for the varying sampling errors and inter-site correlation

from site to site in a region.

The use of IFM however in Australia is undermined by the great heterogeneity among

Australian catchments and any results obtained would be subject to substantial error.

Therefore a method is needed where the assumption of homogeneity may be reduced by

capturing the spatial variability from site to site within a region. This provides ground and

motivation to explore the quantile regression technique (QRT) for design flood estimation

in Australian conditions.

CHAPTER 2

24

2.4.2 STATION YEAR METHOD

The standardised Q values of all the sites in the region are treated as if they form a single

random sample of size n from a common parent population. The pooled standardised data

are then fitted to a suitable distribution, and ZT values are calculated. Since this method

ignores inter-site dependence, it may lead to greater uncertainty and bias, especially at large

return periods (Cunnane, 1988 and Nanadakumar et al., 1997). The issue of inter-site

dependence (see section 2.2.3) or spatial dependence is an issue that has been receiving a

lot of attention in the field of flood and rainfall estimation. The main issues being

researched are ways of (i) estimating spatial dependence based on the theory of max-stable

spatial processes (e.g. Cooley et al., 2006, 2010; Vannitsem and Naveau, 2007 and Vrac et

al., 2007 and Reich et al., 2012) and (ii) incorporating spatial dependence to estimate the

number of independent sites (Ne) in a region (e.g. Buishand, 1984; Hosking and Wallis,

1988; Dales and Reed, 1989; Nandakumar et al., 1997; Stewart et al., 1999; Nanadkumar et

al., 2000; Guse et al., 2009 and Svensson and Jones, 2010).

2.4.3 BAYESIAN ANALYSIS AND MONTE CARLO METHODS

Bayesian inference is another alternative to classical estimation methods such as the

method of moments and maximum likelihood. In Bayesian inference, the understanding of

the likelihood parameters has different values as described by a probability density function

(Reis and Stedinger, 2005). In Bayesian inference, the information in the data can be

represented by the entire likelihood function and also prior knowledge such as a numerical

estimate of the degree of belief or a researcher’s experience in a hypothesis before evidence

has been observed. The method then calculates a numerical estimate of the degree of belief

in the hypothesis after evidence has been observed. In flood frequency analysis parameter

estimation is made through the posterior distribution which is calculated using Bayes’

theorem which is the probability that a frequency function P has parameters , given that

we have observed the realisations d (defined in our data, any historical information, and

limits to be placed on analysis and threshold exceedances).

Bayes' theorem is given by Equation 2.3:

)(

)()./()/(

dP

PdPdP

(2.3)

CHAPTER 2

25

where P(|d) is the conditional probability of , given d (it is also called the posterior

probability because it is derived from or depends upon the specified value of d) and is the

result we are interested in. P() is the prior probability or marginal probability of (`prior'

in the sense that it does not take into account any information about d). P(d|) is the

conditional probability of d given and it is defined by choosing a distribution and

depending on the availability of historical data. P(d) is the marginal probability of d, and

acts as a normalising constant. Since complex models cannot be processed in closed form

in a Bayesian analysis, namely because of the extreme difficulty in computing the

normalisation factor P(d), simulation-based Monte Carlo techniques such as the MCMC

approach which include Metropilis-Hasting algorithm are used in this analysis. More

details about the Metropolis-Hastings algorithm can be found in Geman and Geman (1984),

Casella and George (1992), Metropolis et al., (1953) and Hastings (1970).

The use of Bayes’ theorem for combining prior and sample flood information was

introduced by Bernier (1967). Perrcchi and Rodriguez-Iturbe (1983) discussed some of the

problems associated with Bayesian model choices in hydrology. They also discussed the

use of prior information and tentative alternatives for improvements in Bayesian

hydrological analysis. Ashkanasy (1985) advocated that the use of Bayesian methods

would result in more reliable and credible flood frequency estimates. Bayesian methods in

flood frequency analysis have since then been adopted by many researchers (e.g. Wood and

Rodriguez-Iturbe 1975; Kuczera, 1982a, 1983a, b, 1999; Fortin et al. 1997; Kuczera and

Parent, 1998; Reis and Stedinger, 2005; Reis et al., 2005; Micevski and Kuczera, 2009;

Haddad et al., 2010b, 2012 and Haddad and Rahman, 2012 – the last 3 papers are based on

the research in this thesis and can be seen in Appendix A).

2.4.4 PROBABILISTIC RATIONAL METHOD AS USED IN AUSTRALIA

The rational method was introduced by Mulvaney (1851) and has been widely regarded as

a deterministic representation of the flood generated from an individual storm. However,

the rational method recommended in Australian Rainfall and Runoff (ARR) (I. E. Aust.,

1987; Pilgrim and Cordery, 1993), is based on a probabilistic approach for use in

estimating design floods. This probabilistic rational method (PRM) is represented by:

Q C I AT T t Tc 0 278. , (2.4)

CHAPTER 2

26

where QT is the peak flow rate (m3/s) for an ARI of T years; CT is runoff coefficient

(dimensionless) for ARI of T years; I t Tc , is average rainfall intensity (mm/h) for a design

duration of time of concentration tc hours and ARI of T years; and A is the catchment area

(km2).

The method may be regarded as a regional model, with design rainfall intensity I t Tc , and

catchment area A as independent variables. The runoff coefficient CT is a factor which

lumps the effects of climatic and physical characteristics, other than catchment area and

rainfall intensity. It is noteworthy that in ARR 1987 the values of CT were estimated using

conventional moment estimates from flow records of limited lengths e.g. some sites had

only 10 years of records. Since conventional moment estimates are largely affected by

sampling variability and extremes in the data, a higher degree of uncertainty in quantile

estimation is likely to arise due to CT reported in the ARR 1987. The mapping and use of

runoff coefficients are based on the assumption of geographical contiguity, an assumption

that is unlikely to be satisfied. Pegram (2002) and French (2002) also discussed the

strengths and weaknesses of the PRM.

Rahman and Hollerbach (2003) investigated the physical significance of runoff coefficients

and assessed the extent of uncertainty of design flood estimates obtained by the PRM. By

following the method of derivation in ARR, runoff coefficients were estimated for 104

gauged catchments in South east Australia. The mapping of these C10 coefficients onto a

suitable map of the area indicated that C10 coefficients show little spatial coherence. The C

coefficients are mapped according to the position of the gauging station and some

interpolation is then required for areas where there is no data so that the contours can be

developed. The error introduced into the contours is through the interpolation technique;

this is due to the fact that some regions will be exposed to greater spatial changes in

physical topography and other factors which directly affects the C10 coefficients. In a very

similar fashion Rahman and Holerbach (2003) also stated that while nearby catchments

shows similar meteorological characteristics, they may possess quite dissimilar physical

characteristics, which clearly indicates that the method of simple linear interpolation over a

geographical space on the map of C10 in ARR (I. E Aust., 1987) has little validity.

CHAPTER 2

27

More recently, Rahman et al. 2009 and 2011a, b (2011b given in Appendix A) conducted a

study comparing the PRM to the GLSR based QRT using 107 catchments in the state of

NSW, Australia. The comparison was undertaken using a leave-one-out and split-sample

validation approach examining specific features of each RFFA method. The conclusions

that were drawn from this study were that the QRT-GLSR outperformed the PRM based on

a range of evaluation statistics. Importantly it was found that the PRM and QRT-GLSR did

not perform poorly for the smaller catchments used in the study. Overall, the QRT-GLSR

was advantageous over the PRM in that no assumptions are needed regarding runoff

coefficients as with the PRM. The QRT-GLSR also explicitly differentiates between

sampling and model error thus allowing flexibility for further uncertainty analysis, whereas

the PRM lacks scope for further development.

2.5 QUANTILE AND PARAMETER REGRESSION TECHNIQUES

2.5.1 INTRODUCTION

The United States Geological Survey (USGS) proposed a QRT where a large number of

gauged catchments are selected from a region and flood quantiles are estimated from

recorded streamflow data, which are then regressed against catchment variables that are

most likely to govern the flood generation process. Studies by Benson (1962) suggested

that T-year flood peak discharges could be estimated directly using catchment

characteristics data by multiple regression analysis.

The QRT can be expressed as follows:

...dcb

T DCaBQ (2.5)

Where B, C, D, … are catchment characteristics variables and TQ is the flood magnitude

with T year ARI (flood quantile), and a, b, c, … are regression coefficients.

It has been noted the method can give design flood estimates that do not vary smoothly

with ARI; however, hydrological judgment can be exercised in situations such as these

when flood frequency curves can be adjusted to increase smoothly with T. There have been

various techniques and many applications of regression models that have been adopted for

hydrological regression. Most of these methods are derived from the methodology set out

by the USGS as described above.

CHAPTER 2

28

As an alternative to the QRT, the parameters of a probability distribution can be regressed

against the explanatory variables (Tasker and Stedinger, 1989; Madsen et al., 2002). In the

case of the LP3 distribution, regression equations can be developed for the first three

moments i.e. the mean, standard deviation and skewness of the logarithms of annual

maximum flood series. For an ungauged catchment, these equations can then be used to

predict the mean, standard deviation and skewness to fit an LP3 distribution. This method

here is referred to as ‘parameter regression technique’ (PRT). However, there has been

little research on the applicability of the PRT as compared to the QRT in RFFA.

Regionalising the parameters of a probability distribution (which is referred to as PRT in

this study also offers three significant advantages over the QRT:

1. It ensures flood quantiles increase smoothly with increasing ARI, an outcome that may

not always be achieved with the QRT. The flood quantiles obtained from the PRT may

also be used to determine whether the flood quantiles derived from the QRT provides

similar and consistent results.

2. It is straightforward to combine any at-site flood information with regional estimates

using the approach described by Micevski and Kuczera (2009) to produce more

accurate quantile estimates; and

3. It permits quantiles to be estimated for any ARI within the limits of the developed

RFFA method.

Cunnane (1988) also reviewed methods that use regional information in the estimation of

hydrologic statistics. One versatile approach employs regional information to derive a

relationship between streamflow statistics and catchment characteristics using regional

regression analysis such as the QRT or PRT. Such regional regression methods have been

widely used to estimate hydrologic statistics at ungauged sites (Benson and Matalas, 1967;

Matalas and Gilroy, 1968; Thomas and Benson, 1970; Moss and Karlinger, 1974; Jennings

et al., 1993), and to increase the precision of the statistic of interest at sites with short

record lengths by adding regional information (Kuczera, 1982a; Stedinger, 1983; Madsen

and Rosbjerg, 1997; Fill and Stedinger, 1998; Martins and Stedinger, 2000; Reis and

Stedinger, 2003). Regional regression models such as the QRT or PRT aim to explain

spatial variability of the hydrologic statistic by relating it to catchment variables, such as

catchment area, mainstream slope, mean annual rainfall and percentage of forest cover.

CHAPTER 2

29

The OLSR estimator has traditionally been used by hydrologists to estimate the regression

coefficients in regional hydrological models. But in order for the OLSR model to be

statistically efficient and robust, the annual maximum flood series in the region must be

uncorrelated, all the sites in the region should have equal record length and all estimates of

T year events have equal variance. Since the annual maximum flow data in a region do not

generally satisfy these criteria, the assumption that the model residual errors in OLSR are

homoscedastic is violated and the OLSR approach can provide very distorted estimates of

the model’s predictive precision (model error) and the precision with which the regression

model parameters are being estimated (Stedinger and Tasker, 1985).

To overcome the above problems in OLSR, Stedinger and Tasker (1985) proposed the

GLSR procedure which can result in remarkable improvements in the precision with which

the parameters of regional hydrologic regression models can be estimated, in particular

when the record length varies widely from site to site. In the GLSR model, the assumptions

of equal variance of the T year events and zero cross-correlation for concurrent flows are

relaxed.

The GLSR procedure as described by Stedinger and Tasker (1985) and Tasker and

Stedinger (1989) require an estimate of the covariance matrix of residual errors )(ˆ Y for

the hydrologic statistic of interest.

2.5.2 GENERALISED LEAST SQUARES AND WEIGHTED LEAST SQUARES

REGRESSION

As discussed above, the coefficients of regional regression models have generally been

estimated using the OLSR procedure. However, regionalisation using hydrological data

violates the assumption that the residual errors associated with the individual observations

are homoscedastic and independently distributed (Stedinger and Tasker, 1985). In the case

of hydrological data, variations in streamflow record length and cross–correlation among

concurrent flows result in estimates of the T year events which vary in precision. Matalas

and Benson (1961), Matalas and Gilroy (1968), Hardison (1971), Moss and Karlinger

(1974) Tasker and Moss (1979) have examined the statistical properties of the OLSR

procedures.

CHAPTER 2

30

As shown in the studies cited above, OLSR estimates of the standard error of prediction

and the estimated parameters are generally biased under many situations. Weighted and

GLSR techniques were developed to deal with situations like those encountered in

hydrology where a regression model residuals are heteroscedastic and perhaps cross–

correlated (Draper and Smith, 1981; Johnston, 1972). Tasker (1980) used a weighted least

squares regression (WLSR) procedure to account for unequal record lengths. Marin (1983)

and Kuczera (1982a, b, 1983a) developed an empirical Bayesian methodology, which can

deal with these issues as well.

An obstacle to the use of WLSR and GLSR procedures with hydrological data is the need

to provide an estimate of the covariance matrix of residual errors; this covariance matrix is

a function of the precision with which the true model can predict values of the streamflow

statistics of concern as well as the sampling error in the available estimates of that statistic.

The discussions and examples in the works by Tasker (1980) and Kuczera (1983b)

illustrate the difficulties associated with the estimation of this matrix.

Stedinger and Tasker (1985), in a Monte Carlo simulation with synthetically generated

flow sequences, presented a comparison of the performance of the OLSR procedure with

that of the GLSR one. In situations where the available streamflow records at gauged sites

are of different and widely varying length and concurrent flows at different sites are cross-

correlated, the GLSR procedure provided more accurate parameter estimates, better

estimates of the accuracy with the regression models coefficients being estimated, and

almost unbiased estimates of the variance of the underlying regression model residuals. A

simpler WLSR procedure neglects the cross correlations among concurrent flows. The

WLSR algorithm has been shown to do as well as the GLSR procedure when the cross

correlation among concurrent flows are relatively modest.

2.5.3 PREVIOUS APPLICATION OF GENERALISED LEAST SQUARES AND

BAYESIAN GENERALISED LEAST SQUARES REGRESSION

The GLSR procedure introduced by Stedinger and Tasker (1985 and 1986) has been

extensively used nationally and internationally to estimate the coefficients of regional

regression models of hydrologic statistics (WMO, 1994; Robson and Reed, 1999). Tasker

et al., 1986, 1996, Tasker and Stedinger, 1987; Rosbjerg and Madsen, 1994; Pandey and

Nguyen, 1999; Kjeldsen and Rosbjerg 2002; Feaster and Tasker, 2002; Law and Tasker,

CHAPTER 2

31

2003; Griffis and Stedinger, 2007; Rosbjerg, 2007; Kjeldsen and Jones, 2009 and Haddad

et al., 2011a) have all applied the GLSR for regionalisation of flood quantiles. Madsen and

Rosbjerg (1997) employed the GLSR procedure to obtain regional estimates of the

parameters (i.e. index and LCV) of a GPA distribution employed as a prior distribution in

an empirical Bayesian procedure for flood frequency analysis in New Zealand. Tasker and

Driver (1988) developed regression equations using GLSR to predict mean loads for many

chemical constituents at unguaged sites. GLSR has also been used as the basis of

hydrologic network design (Moss and Tasker, 1991).

Griffis and Stedinger (2007) looked at the GLSR method in more detail. Previous studies

by the US Geological Survey using the LP3 distribution have neglected the impact of

uncertainty on the weighted skew on quantile estimation. The needed relationship was

developed in this paper and its use was also illustrated in a regional flood study with 162

sites from South Carolina. The results were both accurate and hydrologically reasonable.

This paper also looks at new statistical diagnostic metrics such as a condition number to

check for multicollinearity, a new pseudo R2 appropriate for use with GLSR, and two error

variance ratios. Micevski and Kuczera (2009) presented a general Bayesian approach for

inferring the GLSR regional regression model and for pooling with any available site

information to obtain more accurate flood quantiles for a particular site in NSW, Australia.

Tasker (1989), Vogel and Kroll (1990), Ludwig and Tasker (1993), Kroll and Stedinger

(1999) and Hewa et al. (2003) have used GLSR for regionalisation of low-flow statistics.

Madsen et al. (1995), Madsen et al. (2002 and 2009) employed the GLSR procedure in the

regional analysis of extreme rainfall in Denmark, while Overeem et al. (2009) used a

GLSR procedure to establish the correlation structure and infer uncertainty between

parameters of the GEV distribution for extreme rainfalls in the Netherlands. Haddad et al.

(2011a) presented a GLSR procedure that regionalises the parameters of the GEV

distribution for design rainfall estimation in Australia.

Further examples are given below that address regional models of the log-space skewness

coefficient, standard deviation and mean. The current methodology for flood frequency

analysis in Australia and the United States consists of fitting a LP3 distribution to the

gauged data by estimating the mean, standard deviation, and skew of the logarithms of the

annual maximum flows. The problem is that the at-site skewness (shape parameter)

CHAPTER 2

32

estimator is highly variable with typical record lengths often found in Australian data

(average record length of 33 years, Rahman et al., 2011). In order to improve the precision

of the estimator and to reduce uncertainty, Bulletin 17B recommends combining the at-site

estimator with a regional estimate of the skew coefficient (IACWD, 1982; McCuen, 1979;

Tung and Mays, 1981a, b; and McCuen and Hromadka, 1988). Tasker and Stedinger

(1986) applied WLSR to derive a generalized skewness estimator for the Illinois River

basin. They were unable to use GLSR because they did not know how to describe the

correlations among skewness estimators. Martins and Stedinger (2002a) have developed

simple equations for the cross-correlation among skewness (and shape parameter せ of GEV

and GPA distributions) estimators as a function of the cross-correlation of the flood flows

themselves. Martins and Stedinger (2002a) employed those equations to implement a

GLSR model for regional skew estimation.

Reis et al. (2003 and 2005) introduced a Bayesian approach to parameter estimation for the

GLSR regional regression model developed by Stedinger and Tasker (1985) for

hydrological analysis. The results presented in Reis et al, (2005) show that for cases in

which the model error variance is small compared to sampling error of the at–site

estimates, which is often the case for regionalisation of a shape parameter, the Bayesian

estimator provides a more reasonable estimate of the model error variance than the method

of moments (MOM) and maximum likelihood (ML) estimators. This paper also presented

regression statistics for WLSR and GLSR models including pseudo analysis of variance, a

pseudo R2, error variance ratio (EVR) and variance inflation ratio (VIR), leverage and

influence. The regression procedure was illustrated with two examples of regionalisation.

Results obtained from OLSR, WLSR and GLSR procedures using the Bayesian and MOM

model error variance estimators were compared. Gruber et al. (2007) and Gruber and

Stedinger (2008) further develop the Bayesian GLSR (BGLSR) framework first presented

by Reis et al. (2005). This operational regression methodology is used in the estimation of

regional shape parameters, as well as flood quantiles. The focus of this study was to also

implement the Bayesian GLSR framework in conjunction with diagnostic statistics

presented by Tasker and Stedinger (1989), Reis et al. (2005), Reis (2005) and Griffis and

Stendinger, (2007). The new diagnostics statistics for use with Bayesian GLSR provided

comprehensive examination of the developed regression models. More recently, there have

also been some further developments in the BGLSR area for log-space skew estimation for

the non desert regions of California (Parrett et al., 2010). Lamontagne et al. (2011) and

CHAPTER 2

33

Veilleux et al. (2011) also used BGLSR for the estimation of log-space skews for annual

maximum day rainfall flood volumes in the Central Valley and surrounding areas of

California.

2.6 FIXED REGIONS AND THE REGION OF INFLUENCE IN REGIONAL

FLOOD FREQUENCY ANANALYS

2.6.1 FORMATION OF REGIONS

In regional flood frequency analysis, regions have often been defined based on

state/political boundaries. In ARR 1987, regional flood estimation methods were developed

for various Australian states based on fixed regions. The problem with this type of fixed

regions is that at state/regional boundaries, two different methods can provide quite

different flood estimates. To avoid this problem, regions have also been identified in

catchment characteristics data space using cluster analysis (Acreman and Sinclair, 1986),

Andrews curves (Nathan and McMahon, 1990) and various other multivariate statistical

techniques. One limitation with this type of region is that a correct method of assigning an

ungauged catchment to a ‘homogeneous’ region needs to be formulated, which is often

problematic. If the ungauged catchment is assigned to the wrong region/group, the resulting

flood estimation is associated with a high degree of error.

2.6.2 REGION OF INFLUENCE VS FLEXIBLE REGION

Since hydrological characteristics do not change abruptly across state boundaries, it is

desirable to avoid fixed boundaries. Regionalisation without fixed regions was performed

by Acreman and Wiltshire (1987) and Acreman (1987), and based on their work; the region

of influence (ROI) approach was introduced by Burn (1990a, 1990b) where each site of

interest (i.e. catchment where flood quantiles are to be estimated) has its own region. This

way the defined regions may overlap and gauged sites can be part of more than one ROI for

different sites of interest. The great advantage of the ROI approach is that it is not bounded

by geographic regions often based on political boundaries such as state lines, and it thus

avoids discontinuities at the boundaries of regions.

The ROI for the site of interest is formed out of stations in close proximity, with proximity

measured using a weighted Euclidean distance in an M-dimensional attribute space. The

distance metric is defined by:

CHAPTER 2

34

21

1

2

,

M

m

m

j

m

imji XXWD

(2.6)

with Di,j as the weighted Euclidean distance between site i and j, M is the number of

attributes included in the distance measure, and the X terms denote standardised values for

attribute m at site i and site j, and Wm is a weight applied to attribute m reflecting the

relative importance of the attribute. Standardisation of attributes removes units and avoids

introduction of bias due to scaling differences of the attributes. In a range of studies (Burn,

1990a; Zrinji and Burn, 1996; Tasker et al., 1996; Merz and Blöschl, 2005; Eng et al., 2005

and Eng et al., 2007a) the attributes were standardised by the standard deviation over the

entire dataset of attribute m. Attributes can arise from two sources, either based on physical

features, such as catchment area, stream length, channel slope, stream density, or soil type,

or statistical measures of climate and flow data, such as the coefficient of variation.

Since the inception of the ROI procedure, it has been found the ROI can result in improved

flood quantile estimates in terms of root mean square error and that ROI offers the

flexibility of variable regions (Zringi and Burn, 1996). They went on further to refine the

initial ROI approach into a hierarchal ROI approach. The hierarchical ROI approach was

found to perform better for the estimation of higher order moments (i.e. skewness), this is

the case where more sites are needed to form a region. It was found in this study that the

hierarchical ROI approach improved flood estimates in the extreme range. Tasker et al.

(1996) compared five different methods for developing regional regression models to

estimate flood quantiles at ungauged sites in Arkansas, United States. The methods looked

at traditional flood estimation regression approaches, multivariate techniques of cluster and

discriminant analysis and a ROI approach based on geographical and catchment attribute

space where the n gauging sites with the smallest distance made up the ROI for site i. The

study concluded that the ROI approach (based on catchment attributes space) outperformed

the other methods based on the lowest root mean square error.

Eng et al. (2005) used different ROI approaches for estimating the 50 years ARI flood

quantile at ungauged sites in a case study for the Gulf Atlantic Rolling Plains of the

southeastern United States. OLSR was used to regress flood statistics against catchment

CHAPTER 2

35

characteristics for each ungauged site based on data from ROI containing the n closest

gauging sites in both geographical (GROI) and catchment attributes space (CROI). Model

performance was based on the prediction errors from independent testing. From this

testing, it was shown for the two ROI approaches using the n closest gauging sites (based

on geographical distance) was better than using a distance measure in catchment attributes

space. They also found that GROI produced lower errors than CROI.

Merz and Blöschl (2005) examined the predictive performance of several flood

regionalisation methods. They performed the assessment using a jackknife comparison of

at-site estimated regionalised flood quantiles for 575 Austrian catchments. The ROI

methods that only used catchment attributes performed relatively poorer to the methods

that used geographical proximity. The ROI used in this study was then combined with

multiple regression. Merz and Blöschl (2005) were able to demonstrate that when spatial

dependency was incorporated, the ROI showed less random errors.

Eng et al. (2007a) proposed a hybrid ROI (HROI) which combined the GROI and CROI in

a GLSR framework. They applied this method to 1,091 catchments in the southeastern part

of the United States to estimate the 50 years ARI flood quantile. Their study was able to

show that the HROI yielded smaller root mean square estimation errors while also

producing fewer extreme errors often found in either GROI or CROI. From this study it

was concluded that for the 50 years ARI flood quantile, the similarity with respect to

catchment attributes was important, however it was incomplete and that the consideration

of the geographical proximity of the sites provided a useful surrogate for characteristics

that were not included in the analysis. Eng et al. (2007b) went on to also present an

enhanced GLSR and ROI framework that is based on a leverage-guided ROI. This

procedure used two newly defined ROI leverage and influence metrics. They applied their

method to 996 catchments in the southeastern part of the United States. This new leverage-

guided ROI regression provided improvements in terms of lower root mean square errors

while also eliminating all the influential observations.

Gaál et al. (2008) also presented a number of different regional approaches to regional

frequency analysis utilising L-moments and the GEV distribution with the main focus on

the ROI approach for modelling heavy rainfall amounts in Slovakia. This study used

various pooling schemes using different alternatives of sites similarity (pooling groups

CHAPTER 2

36

defined according to climatological characteristics and geographical proximity of sites,

respectively) and pooled weighting factors. The performance of the ROI methods presented

with at-site and other conventional regional methods was assessed through Monte Carlo

simulation studies for rainfall annual maximum series for the 1 and 5 day durations. The

results showed that all the frequency models based on the ROI produced growth curves that

were superior to at-site and conventional regional estimates for most of the sites studied. The National Committee on Water Engineering intends to test the applicability of the

Bayesian GLSR method for Australian catchments which may form the basis of the

revision of the regional flood frequency methods in ARR (Project 5 Regional Flood

Methods). While both the ROI and GLSR have been applied before in a QRT framework

(see Eng et al., 2007a,b), there has been no comprehensive comparison between ROI and

fixed regions in a BGLSR framework. Moreover, there has been no solid comparison

between the estimation of quantiles and the parameters of probability distributions in a ROI

framework.

This thesis as stated above uses the Bayesian approach to the analysis of a GLSR model for

hydrologic statistics (Reis et al., 2005). This relatively new approach is expanded on to

allow computation of the posterior distributions of the parameters and quantiles of the LP3

distribution and the model error variance using a quasi-analytic approach. The Bayesian

approach (Reis et al., 2005) provides both a measure of the precision of the model error

variance that the traditional GLSR lacks and a more reasonable description of the possible

values of the model error variance in cases where the model error variance is smaller

compared to the sampling errors.

The ROI method used in this thesis improves on the current ROI approaches (e.g. Tasker et

al., 1996) where the minimisation of the regression models predictive error variance rather

than selecting or assuming a fixed number of sites to minimise a distance metric is sought.

More details regarding the application of this method are provided in Chapter 3.

2.7 MODEL VALIDATION IN HYDROLOGICAL REGRESSION

ANALYSIS

In multiple linear regression analysis, it is to be resolved which set of the predictor

variables is the best suited or the most optimal for inclusion in the regression equation

CHAPTER 2

37

without over fitting the model and which of the many candidate models is the most

parsimonious one for making the most reliable prediction for the ungauged catchment case;

i.e. addition of unnecessary predictor variables often leads to weaker models (e.g.

producing greater uncertainty).

Validation is generally used to assess a model’s performance in hydrologic regression

analysis. In the validation approach, a fixed percentage of the data (e.g. 10%, 20%) is left

out while building the model, and then the developed model is tested on the left out data,

which is not used in the model building (i.e. validation data set). The validation procedure

has some appealing and important properties, e.g. it assists to select an appropriate model

according to its prediction ability, while at the same time evaluating the prediction ability

of the model for ungauged catchments.

2.7.1 HISTORY OF MODEL VALIDATION

During the last twenty years, the application of different validation methods has been

widely used in different fields of sciences such as Chemometrics (Faber and Kowalski,

1997 and Song Xu et al., 2005) and Econometrics (Racine, 2000), examples include

selection of a model in both univariate and multivariate calibrations using real and

simulated data sets. Song Xu and Zeng Liang (2001) and Song Xu et al. (2005) provided a

detailed study of leave-one-out (LOO) vs. Monte Carlo cross validation (MCCV) in

multivariate calibration and quantitative structure-activity relationship research. The history

of validation methods was summarised by Stone (1974) and Michaelsen (1987). Mosteller

and Tuckey (1977) presented a good introduction of validation methods also. Efron (1983)

and Bunke and Droge (1984) described the statistical behaviour of different validation

methods. In classical statistical literature, validation is most often referred to as LOO. In

LOO, one data point is left out while building a regression model (or other form of model)

and then the model is tested on the previously left out data point. The procedure is repeated

until all the data points are independently tested. Efron (1986) showed that LOO is not very

efficient in estimating prediction error. Marter and Martern (2001) pointed out that LOO

often results in over fitting and provides underestimation of the true prediction error of the

model. An asymptotically consistent method selects the best prediction model with

probability one as the sample size tends to infinity (n). With this definition, LOO has a

smaller chance of selecting the right model; that is, the probability becomes much smaller

than one (see Shao, 1993). In hydrologic regression, often a large number of predictor

CHAPTER 2

38

variables (e.g. catchment area, mean annual rainfall, design rainfall intensity, fraction

forest, soil indices, elevation and slope) are available; here LOO is likely to include

unnecessary predictor variables in the model (Shao, 1993). In such situations, the selected

model tends to perform well in calibration but quite poorly during prediction.

MCCV, a form of model validation, was first introduced by Picard and Cook (1984). Shao

(1993) proved that this method is asymptotically consistent and has a greater chance than

LOO of selecting the best model with more accurate prediction ability. The MCCV leaves

out a notable part of the sample at a time during model building and validation and repeats

the procedure many times. When compared with the ordinary methods for selecting the

best predictor variables (i.e. stepwise regression and employing statistics such as Mallows

Cp or P-value hypothesis) MCCV may be more desirable as it evaluates the different

models according to their predictive ability using many different combinations of

validation data sets. Interestingly, MCCV has not been tested in hydrologic regression

analysis where one often deals with a very limited and scarce observed data set.

2.7.2 PREVIOUS APPLICATIONS OF LEAVE-ONE-OUT VALIDATION IN

HYDROLOGY

The LOO has often been used in hydrology mainly due to limited sample size of

hydrological data; therefore to make the best use of the available data, a LOO has often

been adopted. There is an abundance of literature in regards to LOO in hydrological

applications; we present a few below on the various applications. LOO has found

popularity in the application of estimating rainfall statistics, low-flow indices quantiles and

flood quantile estimation. For example, Brath et al. (2003) investigated the statistical

properties of the rainfall extremes in northern central Italy, the reliability of the estimates to

ungauged sites was assessed by using a LOO validation. Di Baldassarre et al. (2006) used

Monte Carlo experiments and LOO validation in the estimation of uncertainty for design

rainfalls for ungauged sites in northern central Italy as well. Sun et al. (2011) used a LOO

validation for model evaluation in predicting monthly rainfall in the Daqing Mountains in

northern China.

The regionalisation of low flows has gained popularity recently and is of great importance

in hydrological studies, it is also a critical issue for the PUB initiative (i.e. Prediction in

Ungauged Basins of the International Association of Hydrological Sciences – IAHS; e.g.,

CHAPTER 2

39

Sivapalan et al., 2003). Castiglioni et al. (2009) presented the estimation of low flow

indices in ungauged catchments in Italy by applying deterministic and geostatistical

techniques for interpolating low flow indices in physiographical space. A LOO validation

procedure was adopted to quantify the accuracy of each technique when it is applied to

ungauged catchments. Through the LOO validation the conclusion was drawn that the

geostatistical techniques outperformed the deterministic ones. In Austria, Laaha and

Blöschl (2007) presented a national low flow estimation procedure for the whole country

for both gauged and ungauged catchments. In each step of the estimation procedure, many

alternative methods were tested by LOO validation. This led to the identification of the best

performing method for low flow estimation in Austria. Canonical correlation analysis

(CCA) was used in the estimation of low-flows in Greece as shown by Tsakiris et al.

(2011). Tsakiris et al. (2011) also used LOO validation to conclude whether CCA could

reliably assist in catchment classification into sub regions and also whether partitioning the

region into two sub regions offered improvements of low flow quantile estimates through

multiple linear regression.

Many studies in the literature can also be found regarding the use of LOO for validation of

RFFA models (for example see Sankarasubramanian and Lall, 2003; Merz and Blöschl,

2005; Juraj and Ouarda, 2007; Chowdury and Sharma, 2009; Kjeldsen, 2010; Iacobellis et

al., 2011). Merz and Bloschl (2005) examined the predictive performance of several flood

regionalisation methods for 575 Austrian catchments using a LOO comparison. Regional

flood-rainfall duration-frequency modelling at small ungauged sites was undertaken in

south-western Ontario, Canada by Juraj and Ouarda (2007). Model performance was

evaluated by a LOO procedure using evaluation statistics such as average bias and relative

root-mean square error. Chowdhury and Sharma (2009) applied a similar validation

technique which essentially resembled the LOO to predict and forecast arid river flows in

Australia. In the UnitedKingdom (UK), Kjeldsen (2010) used LOO in modelling the

impacts of urbanisation on flood frequency relationships. The LOO showed that the

developed adjustment factors were generally better at predicting the effects of urbanisation

on the flood frequency curve than the adjustment factors currently used in the United

Kingdom.

It can be seen through the literature review that little attention has been paid to-date to the

application of MCCV in RFFA and hydrological applications and the examination of the

CHAPTER 2

40

possible benefits that could be gained from the application of this procedure. Hence, this

thesis looks at three main issues as a part of its broad objective: (1) Demonstrating the

application of MCCV method in hydrological regression analysis using both the OLSR

and GLSR; (2) Comparison of the MCCV with the most commonly applied LOO

validation for selecting the most parsimonious regression model to be applied for ungauged

catchments; and (3) Demonstration of the best use of the limited datasets (often

encountered in hydrology) which can hinder the detailed validation of hydrological

regression models.

2.8 REGIONAL FLOOD FREQUENCY FOR LARGE TO RARE FLOODS

2.8.1 BRIEF REVIEW OF LARGE FLOOD ESTIMATION AND PREVIOIUS

APPLICATIONS

Estimation of large to rare and even extreme return periods is of absolute importance for

hydrological design and risk assessment for large infrastructure. The term ‘large’ floods

refer to floods with 50 to 100 years ARIs (Nathan and Weinmann, 2001). Floods in the

range from 100 years ARI to the ‘credible limit of extrapolation’ (ARI in the order of 2000

years) are referred to as ‘rare’ floods, while floods from the credible limit of extrapolation

to the PMF are termed ‘extreme’ floods. Due to knowledge and data limitation and the

uncertainty involved in extrapolating beyond available data, the errors in final estimates

can be quite high. The average size of recorded flood data referring to Australian small to

medium sized catchments is about 33 years (Rahman et al., 2009). To make better use of

this information and to be able to transfer this information to ungauged catchments again

regional estimation methods are used as described in section 2.4. Some studies both in the

past and present, on an international scale have looked at the advantages and disadvantages

of different regional models for large, rare and extreme floods, (Ferrari et al., 1993;

Kundzewicz et al., 1993; Katz et al., 2002; Castellarin, 2007; Castellarin et al., 2007; Vogel

et al., 2007; Van Gelder et al., 2007; Moisello, 2007; Majone et al, 2007; El Aldouni, 2008;

Laio et al., 2009; Castellarin, 2009; Calenda et al., 2009 and Gaume et al., 2010).

In Australia, the issue of large to extreme flood estimation in the past has been addressed

by some researchers (e.g. Pilgrim, 1986; Rowbottom et al., 1986; Pilgrim and Rowbottom,

1987; Stedinger et al., 1993; Nathan and Weinmann, 2001 and Haddad et al., 2010). Book

VI of Australian Rainfall and Runoff (ARR) was upgraded in 1999 with guidance for

estimation of large to probable maximum floods (PMF). The procedures outlined in

CHAPTER 2

41

ARR1999 include flood frequency analysis and various rainfall-based methods. For flood

frequency estimates in the range of ‘rare’ floods, use of regional information plus

paleohydrological information was suggested and for rainfall-based methods, an annual

exceedance probability (AEP) neutral approach was recommended (Nathan and

Weinmann, 2001).

The statistics of extremes have played an important role in engineering practice for water

resources design and management. There have been recent developments in statistical

theory of extreme values that can be applied to improve the rigour of these flood estimates

and to make the estimates more physically meaningful. The development of more rigorous

statistical methodology for regional analysis of large to rare floods as well as the extensions

in Bayesian methods can help to improve and quantify uncertainty in the estimation

procedure. Although the fundamental probabilistic theory of extreme values has been well

developed for a long time(Leadbetter et al., 1983; Coles, 2001; Cooley et al., 2006; Cooley

et al., 2010) the statistical modelling of large to rare floods remains a subject of active

research. Probability weighted moments (PWM) or L-moments are popular in application

to large and more extreme events hydrology than the ML approach (Katz et al., 2002). L-

moments possess computation simplicity and have very good performance in small samples

(Hosking, 1990 and Hosking et al. 1985).

Regional analysis is another way of making use of more available information that

originated with estimation of large to rare flood hydrology in mind (Dalrymple, 1960 and

Hosking et al., 1985; Jothityangkoon and Sivapalan, 2003; Castellarin et al., 2005;

Castellarin et al., 2007; Douglas and Vogel, 2006; Vogel et al., 2007; and Gaume et al.,

2010). The basic idea is that if a region is relatively homogenous then the estimation of

large to rare flood quantiles at a given site may be improved by using the larger

observations at other sites as well (i.e., a trade off between space and time). Castellarin et al

(2005) introduced an estimator of the exceedance probability associated with a regional

envelope curve (REC) for extreme flood estimation, which accounts for the impact of inter-

site correlation of annual floods. Douglas and Vogel (2006) provided an probabilistic

behaviour interpretation of floods of record in the United States (U.S.) for use with REC.

Castellarin (2007) and Castellarin et al. (2007) apply the probabilistic and multivariate

probabilistic REC) to real data in Italy for extreme flood estimation. They documented that

the multivariate extension outperforms the ordinary REC and provide flood quantile

CHAPTER 2

42

estimation at ungauged sites that are nearly as reliable as index flood quantiles. Vogel et al.

(2007) went onto enhance the method presented by Castellarin et al (2005) and Castellarin

et al. (2007) by introducing a general expression for the exceedance probability of an

envelope curve. A case study was implemented using historical flood series from 226 sites

located across the U.S. The results overall indicated that the approach introduced by Vogel

et al. (2007) offers significant promise for the estimation of large to extreme floods with

envelope curves for heterogeneous regions. More recently Gaume et al. (2010) proposed

approach based on standard regionalisation index methods for extreme floods in Slovakia

and the south of France. They created larger data samples by using historical, paleoflood or

extreme floods occurring in ungauged catchments to reduce the uncertainties on high return

period quantiles in a region.

It is well known that regionalisation models based on the “index flood” method assume

some sort of homogeneity in its application. In particular, it is assumed that the probability

distribution of the standardised variable obtained by normalisation of the annual maximum

flows with respect to the average of the population is the same in all the catchment sites

inside the homogeneous region. As a matter of fact the sample values of these moments

(coefficient of variation (CV) and the coefficient of skewness , if the analysis is limited to

second and third order moments) can vary in a very wide range. Hence since there is a high

variability associated with these parameters of a probability distribution the size of error in

the derived quantile estimates could be very large.

Recently, a new probabilistic model (PM) has been introduced (Majone and Tomirotti,

2004; Majone et al., 2007 and Haddad et al., 2011b) specifically for this sort of analysis.

Majone and Tomirotti (2004) originally calibrated the PM for Italian rivers, and extended

the method using 7300 historical series of annual maximum flows observed at gauging

stations belonging to different geographical areas around the world. Majone et al. (2007)

applied the PM to flood data from 8,500 gauging stations across the world and found that

the method can provide quite reasonable design flood estimates for ARIs in the range of

4,000-9,000 years. In a study conducted by Haddad et al. (2011b) on a data set containing

227 gauging sites from Australia (Victoria and NSW), it was found the PM model when

coupled with GLSR performs very well when applied to ungauged catchments and can

estimate the 200-400 year ARIs floods with reasonable accuracy.

CHAPTER 2

43

The PM (Majone and Tomirotti, 2004; Majone et al., 2007; Haddad et al., 2011b) is based

on the assumption that the standardised maximum values (Qmax) of the annual maximum

flood series from a large number of individual sites in a region can be pooled (considering

the across-sites variations in the mean and CV values of annual maximum floods). The

concept is similar to the Cooperative Research Centre for Catchment Hydrology Focussed

Rainfall Growth Estimation (CRC-FORGE) method (Nandakumar et al., 1997) where

extreme design rainfall estimates are based on pooled rainfall data from a large region up to

several hundred gauges (concept of expanding region). The particular advantages of the

Probabilistic Model is that it does not assume a constant CV across the sites as with the

index frequency approach; this feature, in particular, allows the PM to pool data more

efficiently over a very large region.

The PM termed “large flood regionalisation model” (LFRM) for large to rare floods as

described by Majone et al. (2007) and Haddad et al. (2011b) is chosen for further study as

another objective of this thesis. This method as discussed above is an empirical approach

that makes use of pooled data from various sites while taking into account differences in

means and the varying CV from site to site. This unique form of standardisation allows the

pooling of more data from many stations. As compared to standard methods, the

application of LFRM can overcome many of the difficulties, limitations and assumptions of

large to rare flood frequency analysis. The main focus of the LFRM in this thesis is

expanded to couple it with a spatial dependence model (such as the CRC-FORGE constant

spatial dependence model (e.g. Buishand, 1984; Hosking and Wallis, 1988; Dales and

Reed, 1989; Nandakumar et al., 1997; Stewart et al., 1999; Nanadkumar et al., 2000;

Castellarin et al. 2007; Vannitsem and Naveau, 2007; Guse et al., 2009) that reflects the

reduction in the net information available for regional analysis using spatially dependant

data (see also sections 2.2.3 and 2.4.2) . The LFRM is also to be extended to the ungauged

catchment case by coupling it with the BGLSR method and the ROI to estimate the mean

and CV of annual floods at sites where there is no or little data. The advantages of the

BGLSR and ROI methods have been already discussed in sections 2.5 and 2.6.

CHAPTER 2

44

2.9 IMPACT OF CLIMATE CHANGE ON FLOOD FREQUENCY

ANALYSIS

In the literature, it has been noted that the frequency and magnitude of extreme flood

events are likely to rise in the near future due to climate change (IPCC, 2007; BOM, 2012).

This may have implications to typical flood frequency analysis that assumes that the

‘stationarity assumption’ is valid (Khaliq et al., 2006). Researchers in non-stationary flood

frequency analysis in different parts of the world have questioned the validity of the

traditional flood risk assumptions of stationarity (e.g. Franks and Kuczera, 2002; Cunderlik

and Burn, 2003; Prudhomme et al., 2003; Micevski et al., 2006; Leclerc and Ouarda, 2007;

Pui et al., 2011 and Pall et al., 2011).

There have been number of studies on the identification of trends in flood data. For

example, Olsen et al. (1999) have reported positive trends in flood risk over time for

gauged sites within the Mississippi, Missouri, and Illinois River basins. Douglas et al.

(2000) discovered no evidence of trends in flood flows but they did find evidence of

upward trends in low flows at larger scale in the Midwest and at a smaller scale in Ohio,

the north central and the upper Midwest regions in USA. Negative trends in total

streamflow were most common for the analysed Pennsylvanian streamflow time series

from 1971 to 2001 due to climate variability (Zhu and Day, 2005). Novotny & Stefan

(2007) investigated the streamflow records from 36 gauging stations in five major river

basins of Minnesota, USA, for trend and correlations using the Mann-Kendall (MK) test

and moving averages method. They found that trends diffsignificantly from one river basin

to another, and became more prominent for shorter time windows. Pasquini and Depetris

(2007) presented an overview of flood discharge trends of South American rivers draining

the southern Atlantic seaboard. Juckem et al. (2008) found a decrease in annual flood peaks

for stream gauging stations in the Driftless Area of Wisconsin. Ishak et al. (2013) found

that about 15% of Australian stream gauging stations showed a trend, mainly downward in

eastern and south-eastern and south-westen parsts of Australia and upward in northern

Australia. It should be noted here that this study excluded these 15% stations from the

thesis as this focuses on stationary regional flood frequency analysis. However, the

outcome of study by Ishak et al (2013), part of the PhD thesis of Elias Ishak (another UWS

PhD student), will be used to develop an adjustment factor to correct for the non-

stationarity in regional flood frequency analysis.

CHAPTER 2

45

2.10 SUMMARY

The estimation of flood behaviour (both in terms of credible limit and beyond credible limit

of extrapolation) at ungauged catchments is a common problem in hydrology. Regional

flood frequency analysis (RFFA) is commonly used to “transfer” flood characteristics

information from gauged catchments to ungauged ones. In this chapter, the literature

review has covered a range of currently applied RFFA techniques with particular emphasis

to the quantile and parameter regression techniques (QRT and PRT).

The index flood method (IFM) has been discussed; which assumes the probability

distribution of floods at sites of homogenous regions is identical except for a site specific

scaling factor. Recent studies have shown positive results based on L-moments based IFM

in South-east Australia. However due to the large heterogeneity in Australian catchments a

method that does not strictly require homogeneous regions is more suitable for Australia.

The probabilistic rational method (PRM) is currently recommended in South–east Australia

for design flood estimation in small to medium sized ungauged catchments. Though

considered a regional method and easy to apply it has been criticised by researchers

because of the assumption of geographical contiguity in the mapping and application of the

runoff coefficients.

The QRT and PRT are multiple regression techniques which relate flood quantiles or the

parameters of a probability distribution (i.e. location, scale and shape which are related to

the mean, standard deviation and skew of the flood data) to catchment characteristics

assuming linearity. The advantage of both the QRT and PRT is that no assumptions are

made about runoff coefficients or geographical contiguity, or strict homogenous regions as

with the PRM and IFM, respectively.

The preferred methodology of the QRT and PRT is to use the generalised least squares

regression (GLSR) and in particular the Bayesian GLSR (BGLSR) approach as further

improvement in generalisations can be made with this method such as accounting for

sampling variability and cross correlated data and more importantly distinguishing between

model error and sampling error in the regional model. Furthermore, the Bayesian approach

CHAPTER 2

46

provides both a measure of the precision of the model error variance that the traditional

GLSR lacks and a more reasonable description of the possible values of the model error

variance in cases where the model error variance is smaller compared to the sampling

errors.

The concept of fixed regions and region of influence (ROI) approach in RFFA has also

been discussed. Both the advantages and disadvantages have been outlined. The past

studies presented have all showed the improvements of the ROI over a fixed region

approach. Keeping this in mind along with the high heterogeneity in Australian catchments

it makes sense to combine and compare the methods of the QRT and PRT under a BGLSR

framework with the ROI approach.

Model validation is a very important part of RFFA especially in the area of hydrological

regression. The concept of model validation using split-sample, leave-one-out validation

(LOO) and Monte Carlo cross validation (MCCV) has been discussed. Past studies of the

application of LOO in hydrology have been presented, while studies relating to the use of

MCCV in other fields of science have also been discussed. Given the lack of application of

MCCV in hydrological regression, this thesis will compare both LOO and MCCV for

RFFA model validation in the state of New South Wales in Australia.

Finally, estimation of large to rare and even extreme return periods is of absolute

importance for hydrological design and risk assessment for large infrastructure. The

statistical modelling of large to rare floods remains a subject of active research. A brief

history of large flood estimation has been given in this chapter along with recent studies

and applications in this field. This thesis will present a new large flood regionalisation

model (LFRM) that pools data more efficiently over a very large region. The LFRM will be

combined with a newly developed spatial dependence model that reflects the reduction in

the net information available for regional analysis using spatially dependant data.

CHAPTER 3

47

CHAPTER 3: ADOPTED STATISTICAL TECHNIQUES FOR

REGIONAL FLOOD FREQUENCY ANALYSIS AND MODEL

VALIDATION

3.1 GENERAL

This chapter provides an overall description of the statistical techniques adopted in this

study for (i) regional flood frequency analysis (RFFA) in the range of 2 – 100 years

average recurrence intervals (ARIs) and (ii) Validation of regional hydrological regression

models using leave-one-out (LOO) and Monte Carlo cross validation (MCCV) techniques.

At the outset, a flow chart (Figure 2) is provided which summaries the statistical

procedures and methodologies adopted in this thesis.

At the beginning, log Pearson type 3 (LP3) distribution is described which is fitted to the

observed annual maximum flood series data using a Bayesian parameter estimation

procedure. A discussion is then be presented on the quantile and parameter regression

techniques (QRT and PRT), while the basic theory has been introduced in Chapter 2,

further emphasis is given here on the generalised least squares regression (GLSR), in

particular, the Bayesian GLSR (BGLSR). The region of influence (ROI) approach is then

discussed in the light of its application with the BGLSR. Here the application of the ROI is

based on the minimisation of the predictive variance, which is applied with both the QRT

and PRT regression techniques. The methodology outlined here is intended to highlight the

assumptions involved and to give an overview of how to deal with the various uncertainty

associated with the data to obtain reliable flood estimates.

The next part of this chapter discusses the mathematical formulations used in the model

validation. The theory behind LOO and MCCV is presented (as outlined in Song Xu et al.,

2005), with an emphasis on the hydrologic regression analysis using ordinary least squares

regression (OLSR) and GLSR-based QRT.

.

CHAPTER 3

48

Figure 2 Flow chart showing statistical techniques/ methods adopted in this thesis

Major steps in the research

Data collation

Streamflow data

-Filling missing data, trend

analysis, rating curve error

analysis, outlier testing

- Catchment and climatic

characteristics data

Regional flood frequency

analysis – (ARIs 2 – 100

years)

Comparison of Bayesian GLSR using QRT

and PRT in a fixed region framework

Comparison of Bayesian GLSR using QRT

and PRT in a region of influence framework

Application of LOO and MCCV for

the validation of regional regression

models

Case study for NSW using OLSR

and GLSR

Conclusions and

Recommendations

Comparison and

validation of

methods using data

from NSW, VIC,

QLD and TAS

At–site flood frequency

analysis (LP3 distribution)

Large flood regionalisation model

(LFRM) development using data

from all Australia

Collation of streamflow data,

Homogeneity testing,

Finding appropriate distribution

LFRM and spatial dependence

model development

Ungauged catchment application

At – site flood frequency

analysis

(GEV distribution)

CHAPTER 3

49

3.2 AT-SITE FLOOD FREQUENCY ANALYSIS

3.2.1 BASICS OF AT-SITE FLOOD FREQUENCY ANALYSIS

At–site flood frequency analysis is an elementary step in any RFFA study. The primary

objective of flood frequency analysis is to relate the magnitude of extreme events to its

frequency of occurrence through the use of probability distributions (Chow et al., 1988).

Data observed over an extended period of time in a river system are analysed using

frequency analysis techniques. The data for flood frequency analysis are assumed to be

independent and identically distributed. The flood data are considered to be stochastic and

space and time independent. Furthermore, it is assumed that the flood data have not been

affected by natural or manmade changes in the hydrological regime and climate change

(stationarity assumption).

In flood frequency analysis, a unique relationship between a flood magnitude and the

corresponding ARI T is sought. The task as stated is to extract information from a flow

record to estimate the relationship between Q and T. Three different models may be

considered for this purpose (Cunnane, 1989). These models are (1) annual maximum series,

(2) partial duration series or peaks over threshold series, and (3) time series model. For this

study, annual maximum series flood data is adopted.

Australian Rainfall and Runoff (ARR) (I. E. Aust, 1987) recommends the LP3 distribution

fitted with the method of moments (MOM) for use in at-site flood frequency analysis.

However, research has shown that the reassessment of the LP3 distribution/MOM

estimation is overdue (Wallis and Wood, 1985; Vogel et al., 1993). The recommendations

currently being prepared by the National Committee on Water Engineering include a

variation to the current MOM to Bayesian fitting procedures to estimate the parameters of

the probability distributions used in at-site flood frequency analysis (Kuczera and Franks,

2005). Hence, this method is adopted in this study to estimate the at-site flood quantiles.

The LP3 Bayesian procedure has shown satisfactory results in the study area as

demonstrated by Haddad and Rahman (2008) and Rahman et al. (2011).

CHAPTER 3

50

3.2.2 FLIKE SOFTWARE FOR AT-SITE FFA

The at-site flood quantiles are estimated by FLIKE, which is a computer program

developed by Professor George Kuczera of the University of Newcastle. The FLIKE

program facilitates Bayesian analysis and the method of L-moments for parameter

estimation. The following section briefly describes the LP3 probability distribution.

Kuczera (1999b) presents how FLIKE obtains initial parameter values when searching for

the most probable values.

3.2.3 LOG PEARSON TYPE 3 (LP3) DISTRIBUTION

The LP3 probability model has the following probability distribution function (pdf):

kx0,くfork;x0,くfor

0α;k)xe

く(logexp1αk)xe

く(log)Γ(

くk)く,α,|x

ef(log

(3.1)

with ( being the gamma function.

The LP3 model has been widely accepted in practice because it consistently fits flood data

as well if not better than other probability models. When the skew of logex is zero, the

model simplifies to the log normal. The model, however, is not well-behaved from an

inference perspective. Direct inference of the shape parameter the scale parameter and

the location parameter causes numerical problems. For example, when the skew of logex

is close to zero, the shape parameter tends to infinity. Experience indicates that it is

preferable to fit the first three moments (, and ) of logex rather than and

(Kuczera, 1999b).

This parameterisation is based on the mean (), standard deviation (), and skewness () which is often used to calculate the T-year event quantile:

)( TT KQ (3.2)

where )(TK is the frequency factor which is the T-year quantile of the Pearson type 3 (P3)

distribution with mean zero and standard deviation of 1, and skewness . The frequency

CHAPTER 3

51

factor TK can be approximated with sufficient accuracy by the Wilson-Hilferty

transformation (Kirby, 1972 and Rao and Hamed, 2000) for < 2:

11

66

23

zKT (3.3)

where z is the T-year quantile of the standard normal distribution.

Problems however may arise when the skew of logex is negative, the upper bound on flows

can cause problems. FLIKE avoids this problem by starting the search for the most

probable parameters using log normal MOM parameters fitted to the flood data. This

strategy is quite robust because when the skew of logex is zero, the flow bounds are pushed

all the way to infinity. As a result, the search starts in a region of parameter space well

removed from the constraints imposed by the flow bounds.

Furthermore, a serious problem arises when the absolute value of the skew of logex exceeds

2; that is, when ≥ 1 and when < 1, the LP3 has a gamma-shaped density. However,

when ≥ 1, the density changes to a J-shaped function. Indeed when = 1, the pdf

degenerates to that of an exponential distribution with scale parameter and location

parameter . For ≥ 1, the J-shaped density seems to be over parameterised with three

parameters. The posterior density surface reveals extremely elongated contours which are

suggestive of an over parameterised model. In such circumstances, it is pointless to use the

LP3 distribution. Under this circumstances, it is suggested either the generalised extreme

value (GEV) or generalised pareto (GPA) distributions be used as a substitute (Kuczera

1999b). In this study, no prior information was used in fitting the LP3 distribution. The

parameters of the LP3 distribution (i.e. mean, standard deviation and skewness) were also

extracted from the FLIKE software for use with the RFFA.

3.3 THE CLASSICAL GLS REGRESSION PROBLEM

This section focuses on the basic generalised least squares regression (GLSR) model and

discusses the classical assumptions for this procedure. Subsequent sections recast the

analysis of the GLSR model in a Bayesian framework following Reis (2005) and Reis et al.

(2005). Streamflow data, be it annual maximum or partial duration series data sets, can be

used to derive an empirical relationship between catchment/climatic characteristics

CHAPTER 3

52

variables and the hydrologic statistic of interest. For instance, catchment area and design

rainfall intensity may be used to estimate hydrologic characteristics at a site, such as the

mean annual flow, the 10-year or 100-year peak flow, or the shape parameter of a

theoretical probability distribution like the log-space skewness coefficient () used to fit a

LP3 distribution, or the shape parameter (せ) of a GEV or GPA distribution.

The GLSR model assumes that the quantity of interest yi at a given site i can be described

by a linear function of catchment/climatic characteristics (or a transformation there of) with

an additive error. In matrix notation, the model is represented by:

iXβy (3.4)

where X is a (n × k) matrix of catchment characteristics augmented by a column of ones, く

is a (n × 1) vector of regression parameters that must be estimated and is an (n × 1) vector

of random errors for each of the n sites used in the regression assumed to be normally

distributed with zero mean and the covariance matrix of the form:

Ω)E(2 T (3.5)

wherein 2 is the model error variance and is a positive definite symmetric matrix

(Johnston, 1972; Rencher, 2000; Koop, 2005). Different choices of the matrix allow one

to make different assumptions regarding the nature of the model errors. If Ω is equal to the

identity matrix I, the problem is homoscedastic, and the GLSR model reduces to OLSR.

Uncorrelated errors with different variances at different sites can be described using a

matrix with different variances of the diagonal and zero off the diagonal. In this case, the

GLSR model in Equation (3.5) reduces to weighted least squares regression (WLSR)

model. In the more general case, is defined in such a way that it reflects both

heteroscedasticity and correlation among residuals.

According to the Gauss-Markov-Aitken theorem, where is known, the minimum

variance unbiased estimator for く does not depend on 2 and is given by (Rao and

Toutenburg, 1999 and Koop, 2005):

yΩXXΩX 1TT 11 )(ˆGLS (3.6)

CHAPTER 3

53

The equation above is defined as the GLSR estimator denoted by GLS . Note that the

subscript GLS is sometimes omitted for brevity. The unbiased estimate of 2 is given by:

1

)ˆ()ˆ(ˆ

12

kn

T

GLSGLSGLS βXyΩβXy (3.7)

with sampling covariance matrix:

112 )()ˆvar( XΩX

T

GLS (3.8)

The GLSR estimator is also the best linear unbiased (BLUE) estimator in all classes of

linear estimators. Since the matrix is unknown, we need to use an estimator, which is

usually denoted by .

3.3.1 GLSR, THE STEDINGER AND TASKER MODEL

Stedinger and Tasker (1985, 1986) developed a GLSR model for regional hydrologic

regression. The important difference from the OLSR and the classical GLSR model of the

form given by Equation (3.4) lies in the development and partition of the covariance matrix

of the errors. The GLSR model of Stedinger and Tasker (1985) assumes that the total error

results from two sources: model errors i that are assumed to be independently distributed

with mean zero, 0iE and common variance:

ji0ji,Cov

2 ji (3.9)

and sampling errors that arise due to the fact the actual values of yi are unknown and only

estimates of the quantities of interest are available.

Therefore, Equation (3.4) becomes (following Reis et al., 2005):

iXβhηXβy ˆ (3.10)

where is the sampling error in the sample estimators. Thus, the regression-model errors i

are a combination of: (i) time-sampling-error in the sample estimators iy of yi and (ii)

CHAPTER 3

54

underlying model error i (lack of fit). The total error has mean zero and covariance

matrix:

yIΛii ˆ22 TE (3.11)

where Σ( y ) is the covariance matrix of the sampling errors in the sample estimators (such

as the flood quantiles or the parameters of the LP3 distribution – see Equation (3.2)). Time-

sampling errors in estimators of the yi’s are usually correlated among sites because flows at

nearby sites have similar hydrological mechanisms (e.g. meteorology). Reasonably

accurate estimation of the sampling covariance matrix in the GLSR is very important and is

of great concern and is vital to the solution of the GLSR equations. More details about the

construction of Σ( y ) for flood quantiles and statistics are given in section 3.5 and can be

can be read in Stedinger and Tasker (1985 and 1986).

In this regional framework 2 can be viewed as a heterogeneity measure. Madsen et al.

(1997 and 2002) showed that the regional average GLSR estimator is a general extension

of the record-length-weighted average commonly applied in the index flood method;

however the record-length-weighted average estimator neglects inter-site correlation and

regional heterogeneity (Stedinger et al., 1993 and Stedinger and Lu, 1995).

The GLSR estimator of β is given by:

y)Λ(XX)Λ(X ˆj)j(ˆ 12

δ112

δ TT

GLS (3.12)

The sampling covariance matrix thus becomes:

112 ))(ˆ(ˆ)ˆvar( XX T

GLSGLS (3.13)

The model error variance 2

δj is due to an imperfect model and is a measure of the precision

of the true regression model. Unfortunately the model error is not known and needs to be

estimated. Stedinger and Tasker (1986) proposed a MOM estimator where 2

δj can be

solved iteratively solving Equation (3.12) along with generalized residual mean square

error (MSE) equation given by Equation (3.14):

CHAPTER 3

55

)1()ˆˆ()]ˆ(ˆ[)ˆˆ( 12 knGLS

T

GLS βXyyIβXy (3.14)

In some situations, the sampling covariance matrix explains all the variability observed in

the data, which means that the left-hand side of Equation (3.14) will be less than n – (k + 1)

even if 2ˆ is zero. In these circumstances, the MOM estimator of the model error variance

is generally taken to be zero (Stedinger and Tasker, 1985; 1986). Alternative methods for

estimating the model error variance by maximum likelihood estimation (ML) can be seen

in Kuczera (1983a) and Rencher (2000).

Based on Monte Carlo simulations, Stedinger and Tasker (1986) showed that the MOM

model error variance 2ˆ procedure provides faster and more robust results since no

assumptions are made about the distribution of the residuals, and less biased when the true

model error variance is moderate to large (usually the case for flood quantiles and mean

flood estimation). Stedinger and Tasker (1986) also from their simulation study using

various cross-correlations among concurrent flows (0, 0.3 and 0.6) showed that for small

2ˆ MLs were much more accurate. Actually, the ML estimator for 2ˆ always had a

smaller MSE than the MOM estimator. If the regional regression analysis exhibits a small

model error variance 2ˆ , i.e. this is the case when sampling errors dominate the regional

analysis (e.g. with the regionalisation of the shape parameters of probability distributions

i.e. skewness estimators), the ML procedure should be preferred to the MOM estimator.

Bayesian analysis, which is based on the likelihood function, is also a good candidate for

these situations, and would address the bias concern because on average over the prior,

Bayesian estimators are unbiased (Stedinger, 1983).

3.4 BAYESIAN METHODOLOGY

Reis et al. (2005) developed a Bayesian approach to estimate the regional model

parameters and showed that the Bayesian approach can provide a realistic description of the

possible values of the model error variance, especially in the case where sampling error

tends to dominate the model errors in the regional analysis (Madsen et al., 2002; Reis et al.,

2005 and Haddad et al., 2012). This thesis extends the work of Reis et al. (2005) and

applies the BGLSR to estimate the parameters and flood quantiles of the LP3 distribution –

CHAPTER 3

56

see Equation (3.2)). The BGLSR is chosen as the desired framework as the current GLSR

model analysis methodology based on Tasker and Stedinger (1989) and Griffis and

Stedinger (2007) do not provide an estimate of the uncertainty in the estimated model error

variance of the flood quantiles and the first two moments of the LP3 distribution.

3.4.1 CLASSICAL BAYESIAN INFERENCE

In a Bayesian framework, the parameters of the model are considered to be random

variables, whose pdf should be estimated. The Bayesian approach combines any data with

prior information (if available) about the parameters being estimated (see also section

2.4.3). This information usually is established from other data sets, previous studies or

specific knowledge about the behavior of the system being analysed. Parameter estimation

is made through the posterior distribution which is developed using Bayes’ rule: (Zellner,

1991):

dIp

IpIp

)()|(

)()|()|( (3.15)

Here, )|( Ip is the posterior distribution of the parameter vector , )|( Ip is the

likelihood function for the data, and )( is the prior distribution of . The denominator is

a normalising constant that ensures that the area under the posterior pdf equals one. Reis et

al. (2005) developed a Bayesian approach to estimate the regional model coefficient of the

log-space skewness and showed that the Bayesian approach can provide a realistic

description of the possible values of the model error variance. It is advantageous to provide

a full posterior distribution of the parameters which is done by the Bayesian approach as

compared to classical methods which usually give a point estimate of the parameters.

3.5 BAYESIAN GLS REGRESSION

3.5.1 APPROACH ADOPTED IN THIS STUDY FOR THE QUANTILE AND

PARAMETER REGRESSION TECHNIQUES

To regionalise the flood quantiles (QT), the sampling covariance matrix () of the LP3

distribution is required. Tasker and Stedinger (1989) and Griffis and Stedinger (2007) (p.

84, Eq. 4) provide the approximate estimator of the components of matrix of the LP3

distribution which is given by:

CHAPTER 3

57

jiforn

KKQi

iiiiijiT 222

, )75.01(5.01)(

jifornn

mKKKKQ

ji

jiij

ijjiijjijjiijiT )75.0(5.05.05.01)( , (3.16)

where K is standard LP3 frequency factor, mij is the concurrent record length between sites

i and j, ρij is the lag zero cross correlation of flood peaks between sites i and j, and σi and σj

are the population standard deviation at sites i and j respectively. The skew and standard

deviation in the matrix (3.16) are subject to estimation uncertainty as well. In this study

to avoid correlation between the residuals and the fitted quantiles, following methods are

adopted:

(i) the inter site correlation between the concurrent annual maximum flood series

(ρij) is estimated as a function of the distance between sites i and j;

(ii) the standard deviations (of the logarithms of annual maximum flood series) σi

and σj are estimated using a separate OLSR and GLSR using the explanatory

variables used in the study (given in Chapter 4); and

(iii) the regional skew (of the logarithms of annual maximum flood series) is used in

place of the population skew as suggested by Tasker and Stedinger (1989).

This analysis above used the regional estimates of the standard deviation and

skew obtained from the BGLSR. The detailed information on the covariance

matrices associated with the standard deviation and skew can be found in Reis

et al. (2005) and Griffis and Stedinger (2007). Here we provide an overview of

the covariance matrices.

It is necessary to carry out GLSR on the sample standard deviation and skew, because both

these parameters have an associated estimation error, the approximation in Equation 3.16

should be updated to reflect all the uncertainty associated with the sampling error in the

quantile estimates. The needed estimator of the sample covariance matrix for the standard

deviation and skew is given below in Equations 3.17 through to 3.20:

jiforn

si

iiji 22

, )75.01(5.0)(

(3.17)

jifor

nn

ms

ji

jiijijjiij

ji )75.0(5.0)( ,

CHAPTER 3

58

The off-diagonal elements of the sampling covariance matrix for the skew coefficient

include the term Cov[gi, gj] which is the covariance between the two at-site skew

estimators gi and gj. This term is obtained from:

]Var[]Var[],[Cov jiggji ggggji

(3.18)

where the cross-correlation jigg is estimated using the approximation developed by Martins

and Stedinger (2002a):

ijijijgg cfji

ˆ)ˆSign(ˆ (3.19)

wherein ))((/ jijiijijij nmnmmcf , ijm is the common record period and in , jn are

the extra observation period for station i and j, respectively. Values of are tabulated by

Martins and Stedinger (2002a) for .0.1 In addition, Var[gi] and Var[gj] are evaluated

using the following approximation derived by Griffis and Stedinger, (2007):

42 )(48

15)(

6

91)(

6]Var[ iiiii

ii ncnbna

ng (3.20)

wherein )( ina , )( inb and )( inc are corrections for small samples:

77.118.159.0

9.06.03.0

32

5.869.4531.7)(

86.341.3192.3)(

06.5075.17)(

iii

i

iii

i

ii

i

nnnnc

nnnnb

nnna

(3.21)

The regional skew Gi (Equation 3.22) is used in Equation 3.20 in place of the population

skew i to avoid correlation between the residuals and the at-site estimates of the skew.

Wherein

3

1

3

)2)(1(

)(6

1snn

xxn

nG

ii

n

i

i

i

i

(3.22)

CHAPTER 3

59

where xt is the logarithm of the annual maximum flows in the year t, and s is the sample

standard deviation of xt. Because the true values of skews at each site are unknown, the

regional mean of the skews is used in Equation 3.20.

For the parameter regression technique (PRT), GLSR is also adopted (Tasker and

Stedinger, 1989 and Griffis and Stedinger, 2007) using a Bayesian framework (Reis et al.,

2005) to develop regression equations for the parameters of the LP3 distribution (i.e. mean,

standard deviation, and skew coefficient of the logarithms of the annual maximum flood

series (i.e. , , ). The regional values of standard deviation and skew were found based

on Equations 3.18 to 3.22. The sampling covariance matrix for the mean flood () was

obtained following Stedinger and Tasker (1986) which is given below:

jiforn

qi

iji 2

,)(

jifornn

mq

ji

jiijij

ji ,)( (3.23)

3.5.2 ADOPTED BAYESIAN REGRESSION APPROACH – PRIOR FOR THE β

COEFFICIENTS

As discussed in section 3.4.1, in order to apply the Bayesian analysis to the regional

regression problem in this study, one needs to define prior distributions for the く

coefficients and for the model error variance.

With the Bayesian approach, it is assumed here that there is no prior information on any of

the β coefficients; thus, a multivariate normal distribution with mean zero and a large

variance (e.g. greater than 100) is used as a prior for the regression coefficients as

suggested by Reis et al. (2005). This prior is considered to be almost non-informative,

which produces a pdf that is generally flat in the region of interest.

A multivariate normal distribution prior is given by:

)()(5.0exp)2(

)(2/)1(

2/1

p

T

pk

P (3.24)

CHAPTER 3

60

wherein β has dimension k + 1 and Equation 3.24 is modelled with a mean vector p and

precision matrix P. Zellner (1971) also suggests that the prior can be represented by the

reciprocal of the variance. Zellner (1971) and Congdon (2001) also suggest that a 2

parameter gamma distribution can be used to represent the prior information.

The likelihood function for the data as suggested by Reis et al. (2005) is considered to be a

multivariate normal distribution, so that:

)()(5.0exp1

2),( 1

2/1

2/2 XβyXβy TnL (3.25)

where the covariance matrix is defined in Equation 3.11, n is the number of sites in the

region , y is the vector with the sample values of the hydrologic statistic of interest (i.e.

mean flood, flood quantile etc, and X is the matrix of explanatory variables (catchment

area, design rainfall intensity etc).

3.5.3 ANALYTICAL SOLUTION TO BAYESIAN APPROACH FOR THE

POSTERIOR OF THE MODEL ERROR VARIANCE

To compute the normalising constant in Equation 3.15 it is often useful to use Markov

Chain Monte Carlo (MCMC) algorithms such as the Metropolis-Hastings or Gibbs sampler

algorithms (e.g. Kuczera and Parent, 1998; Micevski and Kuczera, 2009 and Reis et al.,

2003). These algorithms are usually adopted for computationally intense problems, which

really depend on the dimension or complexity of the model being analysed. Given that the

dimension of this problem is relatively straight forward, it can be solved more easily using

the quasi-analytical approximation of the marginal posterior of the model error variance as

discussed by Kitanidis (1986) and Reis et al. (2003 and 2005). Below a brief overview of

equations and steps involved are outlined.

In more simple cases, it is possible to integrate the joint posterior of 2 and over the

possible values of to obtain numerically the marginal posterior of 2 except for the

normalising constant, and hence:

dIfIf ),(),|()|( 222 (3.26)

where

CHAPTER 3

61

),|( 2 If is the likelihood function and ),( 2 is the joint prior for 2 and . The

likelihood function is approximated by a multivariate normal distribution with the

covariance matrix (Λ) Equation given by (3.11).

)()(5.0exp1

2),|( 1

2/1

2/2 XβyXβy TnIL (3.27)

When a non-informative prior for 2 is used with a truly non-informative uniform prior for

results in the joint prior:

2

),( 2 e (3.28)_

The posterior distribution for 2 alone by evaluation can be found:

deIf T )()(5.0exp)|( 12/122

XβyXβy (3.29a)

The expression above can be rewritten as:

deIf TT )ˆ()ˆ()ˆ()ˆ(5.0exp)|( 12/122

XΛXβXyΛβXy1T

(3.29b)

Now, the first term inside the brackets is not a function of and hence can be taken

outside the integral. The integration then becomes:

2/1

2/)2()ˆ()ˆ(5.0exp

XΛXXΛX

1T

1T

n

T d (3.30)

Therefore, the posterior distribution of 2 is proportional to:

)ˆ()ˆ(5.0exp)|( 12/1

122

βXyβXy TT XXeIf (3.31)

The equation as expressed by (3.31) can then be used to calculate numerically the posterior

pdf of the model error variance ( 2 ), and its mean and variance without the need of using

more sophisticated methods based on Monte Carlo simulation. The pdf of the model error

CHAPTER 3

62

variance may also be used to calculate the posterior distribution of the coefficients

using:

222 )|(),|()|( dIfIfIf (3.32)

where ),|( 2 If is a multivariate normal distribution. This result turns out to be a simple

extension to the GLSR procedure developed in Stedinger and Tasker (1985). If one

employs the use of efficient numerical integration procedures, the integral in Equation 3.32,

as well as the mean and variance of 2 , are easily computed.

3.5.4 PRIORS FOR THE PARAMETERS AND THE QUANTILES OF THE LP3

DISTRIBUTION

It is well known that no model is perfect, any model that approximates a phenomenon will

have error associated with it, hence the model error variance ( 2 ) should be strictly

positive. A model error variance of zero in real life is highly unlikely. It is also suspected,

based on previous studies, that the model error variance for the regional skew model will

be modest, this is especially the case when sampling error dominates the model error (or

when the true model error variance is small compared to the sampling errors) (Reis, 2005

and Reis et al., 2005). For the mean flood, standard deviation and flood quantiles, the

model error variance tends to dominate the regional analysis. In this case a zero or negative

value for the model error variance is highly unlikely and a strict informative prior may not

be required. However it is known that the model error variance in this case may suffer bias

if it is estimated by a MOM estimator (Equation 3.14) (Stedinger and Tasker, 1986). This

may introduce further uncertainty into the regional model; hence a Bayesian estimator may

be attractive also in this case. A Bayesian estimator of the model error variance (Equation

3.31) as discussed above may be used to safeguard against uncertain model error variances,

as adopted in this study. Further details can be found in Reis et al. (2005) and Micevski and

Kuczera (2009). In summary, the Bayesian estimator offers a better way of dealing with the

model error variance and quantifying associated uncertainty about it.

The inverse-gamma distribution has been used in the past as it is a conjugate prior for

normal regression problems. However, for GLSR model described by Stedinger and Tasker

(1986) its use may not be attractive. The inverse-gamma is a heavy right-hand tailed

distribution; as such it can assign reasonably large probabilities for big variances when

CHAPTER 3

63

compared to other distributions such as those with exponential tails for example. Given that

issues may arise with the inverse-gamma distribution, in order to avoid these problems, an

exponential distribution is used for the prior. The exponential distribution, because of its

thin right-hand tail, is considered to be more consistent with what it is believed to be the

likely values of the model error variances for regional regression models. It also has a non-

zero pdf at zero, which would allow the data, represented by the likelihood function, to

provide information about the error variance near zero. The exponential pdf is:

2

),( 2 e (3.33)

Reis et al. (2005) provides a detailed discussion of the derivation of the choice of a prior

for the model error variance for regionalising the skew. For the regionalisation of skew, we

employed a value for the prior mean of the model error variance equal to 6 following Reis

et al. (2005), hence:

0,6),( 2622 e (3.34)

To derive the prior distribution for the standard deviation, mean flood and flood quantiles

of the LP3 distribution, we used an informative one-parameter exponential distribution

where the reciprocal of the fractional form of residual error variance estimate taken from

the OLSR is used as the prior mean of the model error variance. For example if the residual

error variance (2

OLS ) from the OLSR is 0.12, we take the inverse of this value i.e. 1/0.12 =

8.33. Hence, the prior distribution of 2 is an exponential distribution with mean equal to

1/8.33, therefore そ = 8.33.

0,33.8),( 233.822 e (3.35)

In should be made clear that the parameter そ can have varying influences on the estimated

coefficients of the regional regression model and on the estimated model error variances, as

such we choose そ values that are most likely to be very close to the real そ values (i.e. taking

the OLSR results of そ).

CHAPTER 3

64

3.6 SELECTING PREDICTOR VARIABLES

This section describes the approach adopted for selecting the predictor variables that should

be included in the prediction equations (regression models). The approach for selecting

predictor variables used in this study provides improvements over current methods used to

justify model selection in the BGLSR framework. Provided below is a discussion on the

BGLSR statistics that guided the model selection.

We use a procedure similar to forward stepwise regression utilising all the sites for each

state (separate regression for each state) and initially adopting just a constant term in the

regression equation. The model error variance and its standard error are noted. We then add

predictor variables starting with area followed by different combinations of other variables.

In all, 16 different combinations of predictor variables were used for the mean, standard

deviation and skew models, while 25 combinations were trialled for the flood quantile

models. Further information regarding the preparation and extraction of the catchment

characteristics can be found in Chapter 4 of this thesis.

3.6.1 AVERAGE VARIANCE OF PREDICTION

In RFFA, the objective is to make prediction at both gauged and ungauged sites; hence a

statistic appropriate for evaluation of model selection is the variance of prediction, which in

many cases depends on the explanatory variables at both gauged and ungauged sites.

Hence, Tasker and Stedinger (1989) suggested the use of the average variance prediction

(AVP). By using a GLSR model, one can predict a hydrological statistic on average over a

new region. Thus, this becomes the average variance of prediction (AVP)new for a new site

which is made up of the average sampling error and the average model error (Tasker and

Stedinger, 1986). For BGLSR analysis according to Gruber et al. (2007):

T

i

n

i

inew Varn

EAVP xyβx ]ˆ[1

][1

2 (3.36)

Also, if the prediction is for a site that was used in the estimation of the regional regression

model, the measure of prediction (AVP)old requires an additional term:

ii

T

i

n

i

iold Varn

EAVP eΛXXΛXxxyβx1T1T

12

1

2 )(2]ˆ[1

][ (3.37)

CHAPTER 3

65

where ei is a unit column vector with 1 at the ith row and 0 otherwise.

3.6.2 BAYESIAN AND AKAIKE INFORMATION CRITERIA

In this study both the Akaike and Bayesian information criteria are used as statistics for

model selection. The Akaike information criterion (AIC) developed by Akaike (1974) is

given by Equation 3.38. It is calculated based on the definition given by Greene (2003),

where SSTO is the Total-Sum-of-Squared deviations about the mean corrected for the

sampling error, n is the sample size for regression and k is the number of predictor

variables in the fitted regression model and 2GLSR is the pseudo coefficient of determination

used in BGLSR and is explained in section 3.6.4. The first term on the right hand side of

Equation 3.38 measures essentially the true lack of fit while the second term measures

model complexity which is related to the number of predictor variables. AIC is given by:

n

kR

n

SSTOAIC GLS

)1(2exp)1( 2 (3.38)

In practice, after the computation of the AIC for all of the competing models, one selects

the model with the minimum AIC value, AICmin..

The Bayesian information criterion (BIC) (Schwarz, 1978) is very similar to AIC, but is

developed in a Bayesian framework and is calculated based on the definition given by

Greene (2003):

n

k

GLS nRn

SSTOBIC

)1(

2 )1( (3.39)

The BIC penalises more heavily models with higher values of k than does AIC. Since

SSTO and 2GLSR depends on the sample size, the competing models can be compared using

AIC and BIC only if fitted using the same sample, as done in this study. As with the AIC,

one selects the model with the minimum BIC value, i.e. BICmin.

3.6.3 BAYESIAN PLAUSIBILTY VALUE

The significance of the regression coefficient values () obtained was evaluated using the

Bayesian plausibility value (BPV) as developed by Reis et al. (2005) and Gruber et al.

(2007), further reading of the mathematical derivations can also be read in the noted

CHAPTER 3

66

references. The BPV allows one to perform the equivalent of a classical hypothesis P-value

test within a Bayesian framework. The advantage of the BPV is that it uses the posterior

distribution of each parameter, which also reflects the prior. The BPV in this study was

carried out at the 5% significance level.

3.6.4 COEFFICIENT OF DETERMINATION

The traditional coefficient of determination (R2) measures the degree to which a model

explains the variability in the dependent variable. It uses the partitioning of the sum of

squared deviations and associated degrees of freedom to describe the variance of the signal

versus the model error. Traditionally for OLSR, the Total-Sum-of-Squared deviations

about the mean (SST) is divided into two separate terms, the Sum-of-Squared Errors

explained by the regression model (SSR) and the residual Sum-of-Squared Errors (SSE),

where SST = SSR + SSE.

Reis et al. (2005) proposed a pseudo co-efficient of determination ( )2

GLSR appropriate for

use with the GLSR. For traditional R2, both the SSE and SST include sampling and model

error variances, and therefore this statistic can grossly misrepresent the true power of the

GLSR model to explain the actual variation in the iy . Hence, for the GLSR a more

appropriate pseudo co-efficient of determination is defined by:

)0(ˆ

)(ˆ1

)0(ˆ

)](ˆ)0(ˆ[2

2

2

222

k

n

knRGLS (3.40)

where )(ˆ 2 k and )0(ˆ 2 are the model error variances when k and no explanatory variables are

used, respectively. Here, 2

GLSR measures the improvement of a GLSR model with k

explanatory variables against the estimated error variance for a model without any

explanatory variable. If )(ˆ 2 k = 0, 2

GLSR = 1 as it should, even though the model is not perfect

because var[ ii ] is still not zero because var[ i ] > 0.

3.6.5 OTHER MODEL SELECTION CRITERIA

A predictor variable having an estimated coefficient (other than the constant) that was less

than two posterior standard deviations away from zero was rejected (this shows the relative

CHAPTER 3

67

importance of the predictor) (Hackelbusch et al., 2009). In all the cases the simplest model

was preferred.

3.7 FORMATION OF REGIONS

The fixed region BGLSR analysis as above identifies the catchment characteristics that best

account for heterogeneity by minimising the model error variance. However, it is assumed

that there remains a possible spatial structure in the model error residuals. With this in

mind the model error variance therefore within possible sub regions of the fixed region

should be less than the fixed region model error variance. This is investigated further in this

study (see Chapter 5). It is in this framework that the ROI approach was applied to the

parameters (i.e. mean, standard deviation and skew) and flood quantiles of the LP3

distribution to further reduce the heterogeneity unaccounted for by the fixed region BGLSR

model.

The ROI approach in this study uses the distance between sites as the distance metric (i.e.

geographic proximity). We apply the ROI within the state boundaries (see Figure 3) in the

following way. For the ROI within the state boundaries, for the first iteration, the 15

nearest stations to the site of interest are selected and a regional BGLSR is performed and

the predictive variance (Equations 3.14 and 3.31) is noted. The initial number of stations

for the first iteration was chosen due to the fact that the smaller ROIs were causing the

BGLSR not to run, i.e. there was a lot of instability in the running of the model. The second

iteration proceeds with the next five closest stations being added to the ROI and repeating

the regression. This procedure terminates when all the sites in the region have been

included in the ROI. The ROI for the site of interest is then selected as the one which yields

the lowest predictive variance.

The ROI approach presented here is fundamentally different to that of Tasker et al. (1996)

in that it seeks to minimise:

(i) the regression model’s predictive error variance rather than selecting or

assuming a fixed number of sites that minimise a distance metric in catchment

characteristic space;

(ii) the ROI criterion of Tasker et al. (1996) cannot guarantee minimum predictive

variance; and

CHAPTER 3

68

(iii) moreover, the selection of sites that are minimally different in catchment

characteristic space may result in greater uncertainty in the estimated regression

coefficients.

It should be noted that the predictive error variance has two terms associated with it:

(i) the model error variance; and

(ii) the predictive variance arising from uncertainty in the estimated regression

coefficients.

The first term is the posterior expected value of the model error variance estimated using

the approach of Reis et al. (2005), see section 3.5.3 and Equation 3.31 – this is always non-

zero and guards against situations where the most likely value of the model error variance

is zero. The second term effectively guards against the ROI favouring fewer sites to

minimise the model error variance; indeed, as the number of sites is reduced, the model

error variance is likely to be offset by an increase in uncertainty in the estimated regression

coefficients (i.e.β ). Figure 3 illustrates the ROI approach as adopted in this study.

CHAPTER 3

69

Figure 3 Example of ROI techniques applied in this study

3.8 REGRESSION DIAGNOSTICS

The assessment of the regional regression model is made by using a number of statistical

diagnostics such as a pseudo–coefficient of determination (as discussed already in section

3.6.4) and the standard error of prediction. An analysis of variance for the BGLSR model is

also undertaken to examine which portion of the total error (sampling or model) dominates

the regional analysis for both the fixed region and ROI methods. This study also uses

Cook’s distance, the standardised residuals and Z-score analysis in a GLSR framework

which is used to identify outlier sites; absence of outlier in regression diagnostics indicates

the overall adequacy of the regional model. These statistics are described below.

Site of interest within

state boundaries

Site of interest within

state boundaries

CHAPTER 3

70

3.8.1 STANDARD ERROR OF PREDICTION

If the standardised residuals have a nearly normal distribution (to be determined in the

residual analysis, see below), the standard error of prediction in percent (SEP) (Tasker et

al., 1986) for the true flood quantile or parameter estimator is described by:

5.0]1)[exp(100(%) newAVPSEP (3.41)

3.8.2 RESIDUAL ANALYSIS

Important to this study is the assessment of the adequacy of the regional regression model

in its application to ungauged catchments. The measure of the raw residual (ri), which is the

difference between the sample (at-site estimate) and regional estimates of the LP3

parameter or flood quantile can be assessed initially for major deviations. However,

interpreting the raw residual may be misleading as the raw residual has three sources of

uncertainty: model error, sampling error and uncertainty due to regression coefficients

being unknown.

In this study, the standardised residual rsi is used, which is the raw residual divided by its

standard deviation defined as the square root of the sum of the predictive variance of the

LP3 parameter or flood quantile and its sampling variance given by the appropriate

diagonal element of the sampling covariance matrix. This yields the definition:

ΛxXΛXx

1Tofdiagonaltheisλwhere

rr iT

iii

isi 5.01 ])([ (3.42)

To assess the adequacy of the estimated LP3 parameters and flood quantiles from the QRT

and PRT, standardised residuals, referred to as Z-scores are used. For site i and a given

ARI, the Z-score is:

2

,

2

,

,,

,

ˆ

)ˆ(log)(log

iARIiARI

iARIeiARIe

iARI

QQZ

(3.43)

Here the numerator is the difference between the at-site flood quantile and regional flood

quantile (estimated from the developed prediction equation) and the denominator is the

CHAPTER 3

71

square root of the sum of the variances of the at-site (2

ARI,ij) and regional ( 2

i.ARIj ) flood

quantiles in natural logarithm space.

It is reasonable to assume that the errors in the two estimators are independent because

iARIQ , is an unbiased estimator of the true quantile estimators based upon the at-site data,

whereas the error in iARIQ ,ˆ is mostly due to the failure of the best regional model to estimate

accurately the true at-site flood quantile. The use of log space makes the difference

approximately normally distributed and hence enables the use of standard statistical tests.

3.8.3 COOK’S DISTANCE

Tasker and Stedinger (1989) developed measures such as Cook’s distance (Di) from an

OLSR to GLSR case. Tasker and Stedinger (1989) and Reis et al. (2005) suggested that

influence is large when Di is greater than 4/n, where n is the number of sites in the region.

Further reading on the mathematical derivations associated with Cook’s distance can be

found in the noted references.

3.9 EVALUATION STATISITCS

A LOO cross validation procedure is applied to assess the performance of the different

RFFA methods. The site that is left out in building the model is in effect being treated as an

ungauged site. Since all the sites in the database are being treated as ungauged for ROI this

automatically satisfies the LOO validation approach. The following performance statistics

are calculated from the fixed and ROI analysis: absolute (abs) median relative error (REr)

in % over n sites, the relative root mean square error (RMSEr) in % and the average ratio

(rr) of the predicted flood quantile to observed flood quantile as described below.

i

ii

obs

obspred

n

iQ

QQabs

1

r MedianRE

(3.44)

(3.45)

n

i iobs

ipred

Q

Q

n 1

r

1r (3.46)

n

i obs

obspred

r

i

ii

Q

QQ

n 1

2

1RMSE

CHAPTER 3

72

where iobsQ is the observed flood quantile at site i obtained from at-site flood frequency

analysis estimated using FLIKE (Kuczera, 1999a), ipredQ is the predicted flood quantile at

site i from the regional prediction equation from QRT and PRT and n is the number of sites

in the region. The REr (%) and RMSEr (%) provide an indication of the overall accuracy of

the regional model. The model with minimum REr is always preferred. For RMSEr the

smallest value between the two competing models with the same number of parameters is

generally preferred. It should be noted here that both the Qpred and Qobs values have

uncertainties associated with them, and in particular, the Qobs values are subject to errors

due to the annual maximum flood record length, rating curve extrapolation errors, selection

of probability distribution and associated parameter estimation procedures. The above error

statistics thus give some guidance about the relative accuracy of the method and should not

be taken as the true uncertainty associated with the method.

The average value of the Qpred/Qobs (rr) gives an indication of the degree of bias (i.e.

systematic over- or under estimation), where a value of 1 indicates good average agreement

between the Qpred and Qobs as both of these values are essentially random variables. An rr

value in the range of 0.5 to 2 may be regarded as ‘desirable (D)’, a value smaller than 0.5

may be regarded as ‘gross underestimation (U)’, and a value greater than 2 may be

regarded as ‘gross overestimation (O)’. It should be mentioned here that these are only

arbitrary limits that are set at a relatively large width band recognising the significant

uncertainty in the estimates from a RFFA method in Australia and hence would only

provide a reasonable guide about the relative accuracy of the methods as far as the practical

application of the methods are concerned.

3.10 REGIONAL UNCERTAINTY WITH FLOOD QUANTILE

ESTIMATION

For the ARIs considered in this study being the (ARIs 2 – 100 years) in this section we

only consider the uncertainty in the regional flood quantile estimation based on the

BGLSR-PRT and ROI where the ROI without state borders is used. In the annual

maximum series models, the mean flood ( ), standard deviation of floods ( ) and

skewness ( ) are considered as regional variables (i.e. the regional at-site estimates of the

LP3 parameters). The regional T-year event estimate for the PRT is given by the following

Equation.

CHAPTER 3

73

)(eeeR

RTReRT KQ (3.47)

where the subscript ‘Re’ refers to the site where the regional estimation is made.

The uncertainty associated with the regional T-year event estimate can be found by

combining the BGLSR method with multivariate normal distribution (MVN). The

advantage of using the BGLSR is that it provides an estimate of the annual maximum series

hydrologic statistics and their associated posterior variances. The posterior variance reflects

the uncertainty related to the residual regional heterogeneity (model error variance) as well

as sampling variability corrected for inter-site correlation while also reflecting the prior

used. Thus the model error variance term is found by Equation 3.31 for the regional

estimate of the , and parameters. These regional values along with the MVN can be

used to quantity the uncertainty in the flood quantile estimates by deriving the 90%

confidence limits.

3.10.1 THE MULTIVARIATE NORMAL DISTRIBUTION

The MVN distribution model extends the univariate normal distribution model to fit vector

observations. An np-dimensional vector of random variables may be defined as follows

npiYYYYY inp ,...,1,,..., 21 (3.48)

is said to have a multivariate normal distribution if its density function f(Y) is of the form

yYyYYYYfYfT

np 12/12/

21 2/1exp2/1),...,()( (3.49)

where y = (y1, …, ynp) is a vector of means (i.e. in this case the regional at-site estimate of

the hydrological statistic of interest) and is the variance-covariance matrix of the MVN

distribution. This can also be given by the notation in Equation 3.50. The variance for use

with the MVN distribution is taken from the BGLSR analysis (i.e. the posterior variances

for each parameter estimate, see Figure 4).

),( ynpNY (3.50)

CHAPTER 3

74

For the univariate case, when np = 1, (i.e. parameter of the LP3 distribution) the one-

dimensional vector Y =Y1 has the normal distribution with mean y and variance 2ˆ .

For the bivariate case, when np = 2, (i.e. and parameters of the LP3 distribution), Y =

(Y1, Y2) has the bivariate normal distribution with two-dimensional vector of means, y =

(y1, y2) and covariance matrix with the correlation (ρ) between the two random variables is

given by:

2

,

,

2

ˆˆˆ

ˆˆˆ

stdevstdevmeanstdevmean

stdevmeanstdevmeanmean

(3.51)

For the trivariate case, when np = 3, (i.e. , and parameters of the LP3 distribution),

Y = (Y1, Y2, Y3) has the trivariate normal distribution with three-dimensional vector of

means, y = (y1, y2, y3) and covariance matrix with the correlation (ρ) between the three

random variables is given by:

2

,,

,

2

,

,,

2

ˆˆˆˆˆ

ˆˆˆˆˆ

ˆˆˆˆˆ

skewskewstdevskewstdevskewmeanskewmean

skewstdevskewstdevstdevstdevmeanstdevmean

skewmeanskewmeanstdevmeanstdevmeanmean

(3.52)

By using Equations (3.50 and 3.52), 10,000 values are generated for each of the mean,

standard deviation and skew of the LP3 distribution (see Equation 3.47).

The T-year flood quantile is then fitted (see Equation 3.47) such that there will be 10,000

eRTQ . TheeRTQ values are then ranked in ascending order of magnitude and the 5th, 50th and

95th percentile values are extracted. Figure 4 provides a good summary of the important

steps involved in deriving the confidence limits.

CHAPTER 3

75

Figure 4 Use of multivariate normal distribution to develop confidence limits by Monte Carlo

simulation

Mean

N~ (y1, mean )

Standard Deviation

N~ (y2, stdev )

Skew

N~ (y3, skew )

Correlation between parameters

stdevmean, , skewmean, , skewstdev,

Simulate 10,000 sets of mean, standard deviation and skew from

the multivariate normal distribution

Obtain 10,000 values of ReTQ from

Equation 3.47.

Order the 10,000 ReTQ values in

ascending order and extract the 5th

and 95th percentile values

CHAPTER 3

76

3.11 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION

MODELS – METHODOLOGY

3.11.1 THE HYDROLOGICAL REGRESSION PROBLEM

Suppose we have a dataset of n sites in a region with k potential catchment characteristics

variables (independent variables) xi1, xi2,…, xik and a response variable yi (i = 1, 2,…, n)

which can be a flood statistic (i.e mean flood)/quantile. The relationship between the

response and independent variables is often assumed to be linear. There are also a few

assumptions made on the data for hydrological regression; for instance, the dataset are

representative of the regression relationship to be developed and the random errors are

homoscedastic, (more about this can be seen in section 3.3.1). The OLSR and GLSR based

regional regression model can be written in matrix notation as:

iXβy (3.53)

Where y = (y1, y2,…,yn)T is the response vector of flood quantiles or flood statistic of

interest (the superscript ‘T’ denotes the transpose), X = (xi,j) (i = 1, 2,…, n; j = 1, 2,…, k) is

a [n k] matrix, is a k-dimensional vector of unknown regression coefficients to be

estimated, is a n 1 random error vector assumed to have mean zero and covariance

matrix defined by:

Ω)E(2 T (3.54)

wherein 2 is the model error variance and is a positive definite symmetric matrix

(Johnston, 1972; Rencher, 2000; Koop, 2005). Different choices of the matrix allow one

to make different assumptions regarding the nature of the model errors. If Ω is equal to the

identity matrix nI , the problem is homoscedastic, and the GLSR model reduces to OLSR.

In more general cases when is defined to reflect heteroscedastisity and correlation among

residuals GLSR is a more reasonable estimator.

Stedinger and Tasker (1985, 1986) developed a GLSR model for regional hydrologic

analysis. The important difference from OLSR is the development and partition of the

covariance matrix of the errors. The GLSR model assumes that the total error results from

CHAPTER 3

77

two sources: model errors i that are assumed to be independently distributed with mean

zero 0iE and a common variance:

ji0ji,Cov

2 ji (3.55)

and sampling errors that arise due to the fact the actual values of yi are unknown and only

estimates of the quantities of interest are available.

Therefore, Equation 3.53 becomes (following Reis et al., 2005):

iXβhηXβy ˆ (3.56)

where is the sampling error in the sample estimators. Thus, the regression-model errors i

are a combination of: (i) time-sampling-error in sample estimators i iy of yi and (ii)

underlying model error i (lack of fit). The total error has mean zero and covariance

matrix:

yIΛii ˆ22 TE (3.57)

where Σ( y ) is the covariance matrix of the sampling errors in the sample estimators. Time-

sampling errors in estimators of the yi’s are usually correlated among sites because flows at

nearby sites have similar hydrological mechanisms (e.g. meteorology). Reasonably

accurate estimation of the sampling covariance matrix in the GLSR is very important and is

of great concern and is vital to the solution of the GLSR equations. More details about the

construction of Σ( y ) for flood quantiles and statistics can be read in Stedinger and Tasker

(1985 and 1986), Reis et al. (2005), Griffis and Stedinger (2007) and section 3.5.1 of this

thesis.

In both the regression approaches (OLSR and GLSR), the true values (i.e. the regression

coefficients) are unknown. To be able to determine the best possible model, it is necessary

to decide which of the different s’ should be included in the model. In typical ordinary

stepwise regression this is equivalent to selecting the best set of independent variables for a

regression model. Considering the case that uses GLSR (see Equation 3.56), where a more

parsimonious model may be true such that:

CHAPTER 3

78

iXy (3.58)

where is a subset of 1, 2,…, k, X indicates the matrix whose columns are the ones in

X that are indexed by the integers in , CR ,Λ indicates the sampling covariance matrix

whose rows and columns are the ones in Λ that are indexed by and indicates the vector

whose components are the ones in that are also indexed by the integers in . Hence there

are in total 2k-1 possible different models of the form represented by Equation 3.58. For the

model of the form of Equation 3.56, if is selected, the model is fitted based on Equation

3.58:

yΛXXΛXβ ˆ)()(ˆ 21121

,, CRCRGLSR

TT (3.59)

where GLSRβ is an estimate of β , when Σ( y ) = 0 Equation 3.59 reduces to the OLSR

solution. Further information on this can be seen in Stedinger and Tasker (1985 and 1986).

Equation 3.59 is solved by employing an iterative procedure using a MOM estimator (see

Stedinger and Tasker, 1985 and section 3.3.1 of this thesis).

After determining the model for use in hydrological regression the overall performance of

the model is then evaluated according to its prediction ability, e.g. how well a model can

predict flood quantiles for ungauged catchments? In most regression applications, the mean

squared error of prediction (MSEP) of a model represents its prediction ability. In practice

the lower the MSEP, the better is the prediction ability of the model.

3.11.2 MODEL SELECTION BY MONTE CARLO CROSS VALIDATION

In statistical inference the term cross validation is usually used and has a wide meaning,

however to avoid ambiguity this study uses the general term ‘validation’ associated with

either LOO or MCCV. In general, validation attempts to select a model based on the

prediction ability of the model (Breiman et al., 1984; Zhang, 1993 and Burman, 1989). For

general validation, when is selected, the n datasets (denoted by S) are split into two parts.

The first part (calibration set), denoted by Sc (with corresponding submatrix XSc and

subvector ySc), contains nc datasets for fitting the model.

CHAPTER 3

79

The second part (validation set), denoted by Sv (with corresponding submatrix XSv and

subvector ySv), contains nv = n - nc datasets for validating the model. There are in total vn

nC different forms of split samples. For each of the split samples, the model is fitted by

the nc dataset of the first part of Sc (Equation 3.59) to obtainGLSRS c ,

ˆ β . The datasets in the

validation set (which are essentially gauged catchments) are treated as if they are

ungauged. The fitted model then can predict the response vector y Sv:

GLSRS

T

SS cvv ,ˆˆ βXy (3.60)

The average squared prediction error (ASPE) over all the dataset in the validation set is:

2

)ˆ(1

);(ASPEvv SS

v

vn

S yy (3.61)

Therefore, letting S be the set whose elements are all from the validation sets corresponding

to (n combination nv) different forms of sample split. The cross validation criterion with nv

datasets left out for validation is defined as:

v

vSS

n

nn

SASPEV v

v

);()(

(3.62)

where )(vnV is calculated for every . Equation 3.62 serves as an approximation of

MSEP() in the situation of finite samples. Although LOO validation can select a model

with bias b = 0 as n, it can however include unnecessary additional independent

variables in the model. In this case the true model dimension k is not considered to be the

most parsimonious and can lead to uncertainty in estimation due to over fitting. For general

validation it has been proven, under the conditions nc and nv/n1 (Shao, 1993), that

the probability for validation (with nv datasets left out for validation) to choose the model

with the best prediction ability tends to one. In this framework the )(vnV (Equation 3.62) is

asymptotically consistent; however, this is not the case for the computation of vnV with

large nv. In such situations, MCCV is an easy and effective procedure.

CHAPTER 3

80

For a selected , randomly split the dataset into two parts Sc(i) (of size nc) and Sv(i) (of size

nv). Repeat the procedure N times. The repeated MCCV criterion is defined as:

N

i

isis

v

n vvv Nn 1

2

)()( )ˆ(1

)(MCCV yy (3.63)

3.11.3 ESTIMATING MSEP

In hydrological regression, the estimate of MSEP is generally based on a finite dataset and

datasets that are very small. Here we mainly consider using LOO or MCCV methods to

estimate MSEP. As was noted in Efron (1986), the estimate of MSEP using observed data

would tend to underestimate the true MSEP for new future observations, since the data

have been used twice, both to fit the model and to check its accuracy. The results obtained

however are an optimistic estimate at most of the models’ true prediction error. For

Equation 3.58 the LOO validation criterion is:

)(min)( 1*11 VV (3.64)

where )( 11 V is obtained by Equation 3.62 (nc = 1) and *1 denotes the optimal model index in

Equation 3.64. In all cases it should be mentioned that MSEP depends on the size of the

calibration data set. MCCV can also be utilised to make the prediction. However, since

MCCV uses only nc datasets for calibration it is considered unnecessary to use

)(MCCV *vv nn to estimate MSEP for the model with n datasets if nv is a large number.

Let )( *vn denote the optimal model index in Equation 3.63. The expected difference between

)(MCCV *vv nn and the mean squared error of prediction for the selected model is

:

nn

n

c

v*

n

*

nn )MSEP()(MCCVEvvv

(3.65)

If a large portion of the dataset is left out for validation, the mean squared error should not

be very small. In such cases, )(MCCV *vv nn might be a poor estimate of )(MSEP *

vn (Burman,

CHAPTER 3

81

1989). In order to obtain slight improvements in accuracy of estimation, a correction term

is needed for )(MCCV *vv nn (Burman, 1989) given by:

2

)(1

2*

n

*

n )ˆ(11

)ˆ(1

)(MCCV)(CMCCV,

***,

*vv iS

N

i

nnGLSRcvnvnGLSRvnvnvv nNn βXyβXy (3.66)

where *,

ˆGLSRvnβ in the second term is estimated based on n catchments and

)(,*

ˆiS GLSRcvnβ in the

third term is estimated based on nc catchments in Sc(i) (i=1, 2,…, N). )(MCCV *

vv nn

indicates the average prediction ability of the model with nc catchments and, as stated

above, it overestimates the MSEP of the model with n catchments. The second term in

Equation 3.66 is the average residual sum of the squares of the model with n catchments.

The third term in Equation 3.66 is the average residual sum of squares and prediction error

of the model based on nc catchments. The latter two terms combine the effects of the model

both with nc and with n catchments.

3.11.4 APPLICATION – USING SIMULATED DATA

Two Monte Carlo experiments are reported in this thesis which compared both OLSR and

GLSR using LOO and MCCV. In the Monte Carlo simulation the following model is

considered:

ioi iiixxxy 332211

(3.67)

Where yi is the dependent variable data and is taken as the 20 year ARI, the simulated

values are assumed to be independent normally distributed random variables estimated at i

= 1, 2,…, 50 which represents 50 stations: (8 sites with 50 years of data, 8 sites with 40

years of data, 12 sites with 32 years, 12 sites with 25 years of data and 10 sites with 15

years of data, this corresponds to an average record length of 33 years which is the average

record length for most Australian catchments). Based on a previous analysis for Australian

data in the Monte Carlo simulation analysis for the GLSR model to estimate the total error

i we took 2 to be low to high random errors in the range of 0 to 1 i.e. N(0, ) where

is 0.25 and 0.95. For the GLSR estimator we also need an estimate of the diagonals of

Σ( y ) (covariance matrix of sampling errors), for normally distributed yi (see Equations 5

CHAPTER 3

82

and 6, pg, 1422 from Stedinger and Tasker 1985) are adopted to generate the sampling

variance for each site of Σ( y ). Furthermore, to estimate the off-diagonal elements of Σ( y )

we also require estimates of cross correlations ( ij ) between concurrent record lengths (i.e.

)ˆ,ˆ(ˆji yy ) in the region. In the Monte Carlo simulation we generated cross correlated data

for ij = 0.30 (modest constant cross correlation between sites) and ij = 0.70 (medium to

high constant correlation between sites).

For OLSR, i are from N(0, ) i.e. standard normal, where are taken as 0.2 and 1

representing low level (smaller spread) and high level (larger spread) random errors

respectively. Hereikx is the ith value of the kth variable xk , and the values of

ikx (k = 1, 2,

3; i = 1, 2,…, 50) and x represents three catchment descriptors (i.e. independent variables)

sampled randomly from uniform and normal distributions in the interval of U[5, 1000] for

x1, N(x1, 0.21) for x2 and , U[2, 20] for x3. The logarithms (base 10) of these descriptor

variables are used in the simulation. To make the simulation more meaningful we also

explore the influences of correlated descriptors (i.e. collinearity) which is very common in

hydrological regression. The collinearity is explored using LOO and MCCV with both

OLSR and GLSR. In this study, we allow x2 to have a high degree of collinearity with x1

(such that the correlation coefficient between x1 and x2 is taken to be 0.90, see above). In

the Monte Carlo simulation, all different combinations of x1, x2 and x3 are considered and

the model with best prediction ability is selected. The resulting regional models for OLSR

and GLSR are respectively:

ii iOLSRxy 1402.096.2ˆ (3.68)

ii iGLSRxy 1517.095.2ˆ (3.69)

The size of the validation sets is taken to be nv = 15, 20, 25, …, 45, and the number of

simulations 500. In order to assess the obtained model, a further 2,000 datasets are

generated using the above procedure for the purpose of prediction. These datasets are used

to calculate the MSEP for the models selected by LOO, MCCV and the assumed true

model.

CHAPTER 3

83

3.11.5 OBSERVED REGIONAL FLOOD DATA FROM NSW, AUSTRALIA

A total of 96 unregulated rural catchments are selected from New South Wales (see more

details in Chapter 4). The geographical distributions of these catchments are shown in

Figure 5. The catchment areas are considered to be small to medium sized (I. E. Aust.,

1987) ranging from 3 to 1000 km2 (mean: 353 km2 and median: 267 km2). The annual

maximum flood series record lengths range from 25 to 75 years (mean: 37 years, median:

34 years and standard deviation: 11.4 years). More information regarding the preparation of

the streamflow data can be found in Haddad et al. (2010a) and also Chapter 4 of this thesis.

The annual maximum flood series are assumed to follow the LP3 distribution for two

reasons: (i). The LP3 distribution is the currently recommended at-site flood frequency

probability model by ARR (I. E. Aust., 1987) and ii). It has also shown consistently better

results in past studies for Australian catchments (see e.g. Haddad et al., 2010a; Haddad et

al., 2012a and Haddad and Rahman, 2012). The LP3 distribution is fitted using a Bayesian

parameter fitting procedure (Kuczera, 1999a) for quantiles of ARIs of 10 and 100 years.

These two ARIs are chosen because they cover both the high and low sides of the flood

distribution.

To apply the GLSR to regionalise the flood quantiles the sampling covariance matrix Σ( y )

of the LP3 distribution is required. Tasker and Stedinger 1989 and Griffis and Stedinger

(2007) (p. 84, Equation 4, also see Equation 3.16 in this thesis) provide the approximate

estimator of the components of Σ( y ) matrix of the LP3 distribution. It should be mentioned

here that other distributions like GEV could have been adopted; however, it is unlikely to

affect the outcomes of the analysis. Furthermore, the LP3 distribution has been found to

outperform the GEV distribution generally for eastern Australia (Zaman et al., 2012). The

skew and standard deviation in the Σ( y ) matrix are subject to estimation uncertainty. In

this study to avoid correlation between the residuals and the fitted quantiles, the following

procedures are adopted:

(iv) the inter site correlation between the concurrent annual maximum flood series

(ρij) is estimated as a function of the distance between sites i and j;

(v) the standard deviations (of the logarithms of annual maximum flood series) σi

and σj are estimated using a separate OLSR and using the predictor variables

used in the study (see below); and

CHAPTER 3

84

(vi) the regional skew (of the logarithms of annual maximum flood series) is used in

place of the population skew as suggested by Tasker and Stedinger (1989).

This analysis above uses the regional estimates of the standard deviation and

skew obtained from GLSR. The detailed information on the covariance matrices

associated with the standard deviation and skew can be found in Reis et al.

(2005) and Griffis and Stedinger (2007) and Equations 3.17 and 3.18 – 3.22).

Twelve climatic and catchment characteristics variables were selected. (More information

regarding the extraction and preparation of the catchment characteristics can be found in

Chapter 4 of this thesis). The predictor variables were log-transformed (base 10) and

centered around the mean for the regression analysis.

3.12 SUMMARY

A number of statistical techniques and formulations to be used in this thesis have been

presented in this chapter.

On the onset of this chapter, fitting the LP3 distribution to the observed flood data using a

Bayesian parameter fitting procedure has been presented. The GLSR procedure has then

been discussed both in its classical application and in hydrologic regression context to

derive regional regression equations relating flood quantiles to catchment and climatic

characteristics using both a QRT and PRT framework. The Bayesian GLSR (BGLSR)

regression procedure was discussed in more detail. The setting up of the residual error

covariance matrices with the BGLSR approach has also been discussed. This chapter has

also discussed the formation of regions in RFFA which has included both the fixed and

region of influence approaches.

The second part of this chapter discussed the mathematical formulations used in the model

validation in the context of hydrologic regression analysis using OLSR and GLSR. The

statistical framework for the numerical experimentation and practical application

demonstrating the use of LOO and MCCV in hydrologic quantile regression analysis has

also been discussed.

The next chapter will discuss the study areas and the different aspects of streamflow and

catchment characteristics data collation and preparation.

CHAPTER 4

85

CHAPTER 4: STUDY AREA AND PREPARATION OF

STREAMFLOW AND CATCHMENT CHARACTERISITICS DATA

4.1 GENERAL

The assembly and preparation of streamflow data is an important step in any regional flood

frequency analysis (RFFA) study. This chapter describes various aspects of the streamflow

data collation adopted for this work e.g. selection of the study area, selection of stream

gauging sites, checking annual maximum streamflow data, filling gaps in the streamflow

data series, checking rating curve extrapolation errors associated with the streamflow data

series, checking for outliers in the data series and testing for any significant trends that

could undermine the purpose of flood frequency analysis.

Because this study is primarily concerned with developing regional prediction equations for

design flood estimation using both a quantile and parameter regression technique; an

elementary step in any regional study such as this involves obtaining both climatic and

catchment characteristics data. Identifying the most relevant catchment characteristics is

difficult as there is no objective method for doing this; also many catchment characteristics

are highly correlated, thus the presence of many of these in the model can cause problems

with statistical analysis such as introducing multicollinearity and secondly it does not

provide any extra useful information.

Rahman (1997) indicated that there is no objective method for selecting catchment

characteristics, thus an initial selection of candidate characteristics should be based on an

evaluation and success of catchment characteristics used in past studies. Rahman (1997)

considered in detail all possible climatic/catchment characteristics (referred as catchment

characteristics henceforth) from over 20 previous studies to develop a reasonable starting

point.

Nevertheless, no general inference about the significance of a particular catchment

characteristic can be made on the fact that an investigator has found it to be significant,

since in a regional study such as this dominant characteristics may vary from region to

region.

CHAPTER 4

86

In the second part of this chapter, the catchment characteristics to be used in this thesis are

selected with the aim of developing a working database of catchment characteristics.

Initially the selection of candidate catchment characteristics is described in sufficient detail

and aspects of data collation/collection are presented later.

4.1.1 PUBLICATIONS

A Journal paper (ERA, rank B) has been published on the materials presented in this

chapter. This journal paper is given in Appendix A. The following is the reference of the

paper.

Haddad, K., Rahman, A., Weinmann, P.E., Kuczera, G. and Ball, J.E. (2010a).

Streamflow data preparation for regional flood frequency analysis: Lessons from south-east

Australia. Australian Journal of Water Resources, 14 (1), 17-32.

4.2 STUDY AREA

For this study, the Australian continent is selected as the study area. For flood quantile

estimation in the range of 2 – 100 years average recurrence intervals (ARI), the quantile

and parameter regression techniques (QRT and PRT) in a Bayesian generalised least

squares regression (BGLSR), fixed and region of influence, (ROI) frameworks are applied

in the states of Queensland (QLD), New South Wales (NSW), Victoria (VIC) and

Tasmania (TAS). The model validation case study makes use of the data from NSW, while

for the large flood analysis using the large flood regionalisation model (LFRM), 626

stations are used from all over the Australian continent, excluding the arid and semi arid

regions. The selected study area is shown in Figure 5.

CHAPTER 4

87

Figure 5 Plot of the selected study area (i.e. NSW, VIC, QLD and TAS)

4.3 SELECTION OF CANDIDATE CATCHMENTS

The following factors and criteria were considered in making the initial selection of the

study catchments.

Catchment area: The proposed regionalisation study aims at developing prediction

equations for flood estimation in small to medium sized ungauged catchments. Since the

flood frequency behaviour of large catchments has been shown to significantly differ from

smaller catchments, the proposed method should be based on small to medium sized

catchments. ARR (I. E Aust, 1987) suggests an upper limit of 1000 km2 for small to

medium sized catchments, which seems to be reasonable and is adopted here. For larger

catchments, the flood frequency curves are generally flatter as compared to the smaller

ones. Since the focus of RFFA technique is design flood estimation to small ungauged

catchments, the use of very large catchments in the development of RFFA techniques is not

justified as per ARR (I. E Aust, 1987).

CHAPTER 4

88

Record length: The streamflow record at a stream gauging location should be long enough

to characterise the underlying probability distribution with reasonable accuracy. In most

practical situations, streamflow records at many gauging stations in a given study area are

not long enough and hence a balancing act is required between obtaining a sufficient

number of stations (which captures greater spatial information) and a reasonably long

record length (which enhances accuracy of at-site flood quantile estimates. Selection of a

cut-off record length appears to be difficult as this can affect the total number of stations

available in a study area. However for this study, the stations having a minimum of 10

years of annual instantaneous maximum flow records are selected initially as ‘candidate

stations’. This is because that sample size smaller than 10 years may not be useful in RFFA

in Australia as this often suffers from long periods of droughts and flood quantile estimates

with smaller record lengths this may provide biased results. Here 10 years is the cut-off

record length; however, the adopted threshold was 24 years for most of the Australian

states as noted later in this chapter.

Regulation: Ideally, the selected streams should be unregulated, since major regulation

affects the rainfall-runoff relationship significantly (storage effects). Streams with minor

regulation, such as small farm dams, may be included because this type of regulation is

unlikely to have a significant effect on annual floods. Gauging stations subject to major

regulation are not included.

Urbanisation: Urbanisation can affect flood behaviour dramatically (e.g. decreased

infiltration losses and increased flow velocity). Therefore, catchments with more than 10%

of the area affected by urbanisation are not included in the study.

Landuse change: Major landuse changes, such as the clearing of forests or changing

agricultural practices modify the flood generation mechanisms and make streamflow

records heterogeneous over the period of record length. Catchments which have undergone

major land use changes over the period of streamflow records are not included in the data

set.

Quality of data: Most of the statistical analyses of flood flow data assume that the available

data are essentially error free; at some stations this assumption may be grossly violated.

CHAPTER 4

89

Stations graded as ‘poor quality’ or with specific comments by the gauging authority

regarding quality of the data were assessed in greater detail; if they are deemed ‘low

quality’ they are excluded. For example, if there were lots of missing data, and the gauging

station location was sifted a long way from the previous location, the station was excluded.

4.4 STREAMFLOW DATA PREPARATION

4.4.1 FILLING MISSING RECORDS IN ANNUAL MAXIMUM FLOOD SERIES

Missing observations in streamflow records at gauging locations are very common and one

of the elementary steps in any hydrological data analysis is to make decisions about dealing

with these missing data points. Missing records in the annual maximum flood series are in-

filled where the extra data points can be estimated with sufficient accuracy to contribute

additional information rather than ‘noise’. For this study, one of the following methods (a

or b) is applied, as documented in Rahman (1997) and Haddad et al. (2010a).

(a) Comparison of the monthly instantaneous maximum (IM) data with monthly

maximum mean daily (MMD) data at the same station for years with data gaps. If a

missing month of instantaneous maximum flow corresponds to a month of very low

maximum mean daily flow, then that is taken to indicate that the annual maximum

did not occur during that missing month.

(b) Application of a linear regression between the annual maximum mean daily flow

series and the annual instantaneous maximum series of the same station. Regression

equations developed are used for filling gaps in the IM record, but not to extend the

overall period of record of instantaneous flow data.

For in-filling the gaps, Method (a) is preferred over Method (b), as it is more directly

based on observed data for the missing month and involves fewer assumptions.

4.4.2 TREND ANALYSIS

Hydrological data for any flood frequency analysis, be it at-site or regional, should be

stationary, consistent and homogeneous. The annual maximum flow series should not show

any time trend to satisfy the basic assumption of stationarity with traditional flood

frequency analyses methods. Thus, in this study, a trend analysis is carried out where

CHAPTER 4

90

possible to identify stations showing significant trend and the stations which do not show

any trend are included in the primary data set for each Australian state.

Two tests are initially applied to detect time trend, the Mann–Kendall test (Kendall, 1970)

and the distribution free CUSUM test (McGilchrist and Wodyer, 1975); both tests are

applied at the 5% significance level. The Mann-Kendall test is concerned with testing

whether there is an increase or decrease in a time series, whereas the CUSUM test

concentrates on whether the mean values in two parts of a record are significantly different.

As a useful guide and in addition to the trend tests, a simple time series plot and a

cumulative flow graph of the station are also used to detect shifts in the annual maximum

flood data.

4.4.3 RATING CURVE ERROR AND IDENTIFICATION

Most stream gauging authorities establish a network of streamflow gauging stations to

obtain continuous streamflow data. However, in most cases, these do not measure the

actual discharge directly. Rather it is the stage that is recorded, and subsequently

transformed to discharge by means of an estimated rating curve, which is constructed in

most cases by correlating measurements of discharge with the corresponding observations

of stage. However, the range of observed flood levels generally exceeds the range of

‘measured’ flows, thus requiring different degrees of extrapolation of well established

rating curves. Thus, most of the discharges calculated by rating curve are subject to

uncertainty. Different methods of rating curve extrapolation are associated with a range of

assumptions, from simple extension of fitted regression lines to hydraulic analysis methods

requiring additional data. The magnitude of rating curve extrapolation errors depends on

the stream and flood plain conditions near the gauging station, the strengths of the

assumptions made in extrapolation, and the degree of extrapolation beyond the range of

measured flows (Kuczera, 1999a).

Any rating curve extrapolation error is directly transferred into the largest observations in

the annual maximum flood series, and use of these extrapolated data in flood frequency

analysis can result in grossly inaccurate flood estimates, particularly for higher ARIs.

There are several studies that have examined the uncertainty of a single discharge estimate

due to rating curve variability using a regression-based approach, e.g., Venetis (1970),

CHAPTER 4

91

Dymond and Christian (1982) and Reitan and Petersen-Øverleir (2008). On the other hand,

the impact of rating curve error and imprecision in the estimation of the flood quantile has

received less attention in hydrological literature (Petersen-Øverleir and Reitan, 2009).

Potter and Walker (1981), Rosso (1985), Shuzheng and Yinbo (1987) and Kuczera (1992,

1996) provided some insights into the problem by analysing a multiplicative error model.

Kuczera (1996) and Reis and Stedinger (2005) adopted a multiplicative error model in a

Bayesian framework to deal with rating curve error. From these studies, the main

conclusion to be drawn is that multiplicative measurement error introduces bias into

estimated flood quantiles.

In this study, the stations having annual maximum flood data associated with high degree

of rating curve extrapolation are identified by introducing a ‘rating ratio’ (RR). The annual

maximum flood series data point for each year (estimated flow QE) is divided by the

maximum measured flow (QM) for that station to define the rating ratio (See Equation 4.1).

Moreover the rating ratio is based on the highest measured flow over the total period of

record, and the annual maximum flows are based on the gauging authorities’ best estimate

of the rating curve applicable at the time of that flow event.

M

E

Q

QRRRatioRating )( (4.1)

If the RR value is below or near 1, the corresponding annual maximum flow may be

considered to be free of rating curve extrapolation error. However, a RR value well above 1

indicates a rating curve error that can cause notable errors in flood frequency analysis.

As an example, for Station 222202, there are 11 data points with RR values greater than 1

(27% of total data points) and the maximum value of RR is 5.5 (Figure 6). This large

degree of rating curve extrapolation is likely to affect flood frequency estimates at this

station, especially the higher ARI floods such as Q50 and Q100, unless appropriate measures

are taken. The application of RR is discussed further in the latter part of this chapter.

For any RFFA, a large number of stations with reasonably long record lengths are required

and hence a trade-off needs to be made between an extensive data set that includes stations

with very large RR values (and thus lower accuracy) and a smaller data set with RR values

CHAPTER 4

92

restricted to what could be considered to be a “reasonable upper limit” of rating curve

errors.

A working method to decide on a cut-off RR value is determined by looking at the average

and the maximum RR values for each station in a region/state. Based on the results from

VIC and NSW, the RR values found to represent a reasonable compromise between

accuracy at individual sites and total size of the regional data set are an average of 4 and a

maximum of 20.

Likely Rating Curve Error - 222202

0

1

2

3

4

5

6

0 5 10 15 20 25 30 35 40 45Data Point

QE/Q

M

Data points subject to possible rating curve errors

Figure 6 Plot of rating ratios (RR) for station 222202

4.4.4 SENSIVITY ANALYSIS AND IMPACT OF RATING CURVE

EXTRAPOLATION ON FLOOD QUANTILE ESTIMATES

Typically error arising from rating curve extension is smooth and can therefore introduce

systematic error of both over- or under-estimation of the true discharge. The rating curve

extension error coefficient of variation is not well known, however Potter and Walker

(1981) suggest it could be as high as 30% in poor situations, such as errors in the

extrapolation zone (see Figure 7). In the interpolation zone however where the rating curve

is well defined by discharge-stage measurements, typically the error coefficient of variation

CHAPTER 4

93

would be small, say 1% to 5% (Kuczera, 1996 and Reis and Stedinger, 2005). As noted by

Kuczera (1999a), there are two cases in which smooth rating curve extension can introduce

systematic error. Firstly an indirect estimate can be made for large floods well beyond the

measured flow; it is this estimate that is then subject to extreme uncertainty. In such cases

estimates that are well below the true discharge can cause significant underestimation in

flood frequency analysis and vice versa. Rating curves are also extended by the slope-

conveyance method, which mainly relies on extrapolation of gauged estimates of the

friction slope so that this slope converges to a constant value. This can cause considerable

systematic error which is difficult to quantify as compared to the log-log extrapolation. As

it is the most commonly employed approach for rating-curve extrapolation, log-log

extrapolation is explored in this study.

In log-log extrapolation, the systematic error can be seen as the likely divergence from the

true rating as the discharge increases. Thus, as the rating curve is extended from the true

rating curve an extension zone is introduced. This extension zone depends on the distance

from the anchor point and not from the origin. In this case the systematic error is

incremental, as it originates from the anchor point. In this study, to implement the concept

of systematic rating curve error, the flow that is closest to RR = 1 is used as the “anchor

point” in the FLIKE rating curve error model (Kuczera 1999b). The assumption is then

made that there is little error (1 to 5%) up to the anchor point (Figure 7). All discharge

estimates with RRs > 1 (this means the true flood discharge exceeds the anchor value) have

systematic error and deviate away from the anchor point. The application of the RR using a

cut-off point value is introduced in this study to remove stations which are likely to be

associated with high rating curve related errors. Further discussion on this is presented in

this chapter, where the impacts on flood quantile estimates of different rating curve errors

and RR values are examined, to demonstrate the importance of accurate flood discharge

estimates.

CHAPTER 4

94

Figure 7 Rating curve extension error

4.4.5 TESTS FOR OUTLIERS

In a set of annual maximum flood series there is a possibility of outliers being present. An

outlier is an observation that deviates significantly from the bulk of the data, which may be

due to errors in data collection or recording, or due to natural causes.

In this study, the Grubbs and Beck (1972) method is adopted in detecting high and low

outliers. This method was recommended in Bulletin 17B by the United States Water

Resources Council after large scale testing of a wide variety of procedures. The method is

based on determining high outlier and low outlier thresholds by applying a one-sided 10%

significance level test that considers the sample size. The test was developed by Grubbs

and Beck (1972) for detecting single outliers from a normal distribution but (when applied

to the logs of a flood data series) has been shown to be also applicable to the log Pearson

type 3 (LP3) distribution. The method is simple to use and has been widely applied in

North America (Ng et al., 2007). Its application to dealing with low outliers is

straightforward. However, it should be noted here that special precaution is needed to treat

any detected high outlier, given that there is a 10% chance of the null hypothesis of no

outliers having been wrongly rejected. If not caused by data error, the 'high outlier' data

point contains very useful information regarding the frequency of large floods.

Anchor Point, RR=1

Maximum measured flow

Actua1 rating curve (reported by gauging

authority)

log discharge

Interpolation zone

True but unknown

rating curve

log stage

Incremental error

Extrapolation zone

CHAPTER 4

95

4.5 RESULTS OF STREAMFLOW DATA PREPARATION PROCESS

The methods described in section 4.4 are applied to gauged flood data to the entire

Australian continent. In this section we present the detailed results for VIC and NSW for

simplicity sake only; further results are summarised and further reading can be found in

Rahman et al. (2009 and 2011a). This section summarises the main findings.

4.5.1 DATA PREPARATION FOR VICTORIA

Based on the selection criteria presented in section 4.3, a total of 415 stations are initially

selected as candidates from VIC each having a minimum of 10 years of streamflow record.

For in-filling the gaps in the annual maximum flood series, Method (a) is preferred over

Method (b) (see section 4.4.1 for a description of these methods). The following points

summarise the results of the in-filling of the annual maximum flood series data: (i) 273 data

points from 187 stations are in-filled by Method (a); (ii) 60 data points from 44 stations are

in-filled by Method (b); (iii) Regression equations used in gap filling have high R2 values

(range 0.82 – 0.99, mean = 0.93 and SD = 0.041); and (d) 10% of stations do not have any

missing records.

After in-filling the gaps, the stations are checked for possible trends. Initially, the Mann-

Kendall test is applied to the annual maximum flood series of the candidate stations. The

results revealed that some 20% of the candidate stations exhibit a decreasing trend, a

somewhat surprising result. However, the record lengths of many of these stations are less

than 20 years, and, moreover, south-east Australia has experienced a severe drought since

the mid 1990’s. To explore this issue further, time series plots and mass curves are

prepared for the stations showing trend to detect visually if significant changes in slope can

be identified. Figure 8 (a) presents the results for Station 230210, which shows a noticeable

decrease in annual maximum flood data from the late 1980’s thus supporting the results

from the Mann-Kendall test. The CUSUM test produced similar results – see Figure 8 (b) -

namely a downward shift in the mean from 1995 onwards.

These results suggest that flood data at many stations are not independently and identically

distributed from year to year. Thus there needs to be caution applied when using short

records in estimating long term flood risks. The fact that data starting in the 1990s

exhibited a significant downward trend for many stations in VIC makes the inclusion of

CHAPTER 4

96

stations with short records in RFFA questionable. Most RFFA methods can compensate for

sampling variability but not for bias introduced by a drought-induced systematic downward

trend in a short record.

To overcome this problem, the introduction of a longer cut-off record length appears to be

appropriate. However, the selection of a cut-off record length involves a trade-off between

spatial coverage and bias. It is judged that a cut-off record length of 25 years is adequate

for the purpose of this study. Although this has removed more than half of the candidate

stations from VIC, the remaining stations would be less affected by bias and thus would

yield more representative RFFA assessments of long-term flood risk. The number of

eligible stations after the introduction of a cut-off length of 25 years, dropped to 144, which

is only 35% of the initially selected 415 stations. This shows that the useful data set for

RFFA in a given region is likely to be substantially smaller than the primary data set.

Figure 8 (a) Time series plot showing significant trends after 1995 and (b) CUSUM test plot showing

significant trends after 1995. Here Vk is CUSUM test statistic defined in McGilchrist and Wodyer

(1975)

20052000199519901985198019751970

10000

7500

5000

2500

0

Year

Ann

ual M

axim

um F

low

(M

L/d)

20052000199519901985198019751970

8

6

4

2

0

Year

Vk

Station 230210

Decrease in flow magnitude

a

Station 230210

b

CHAPTER 4

97

In the remaining data set of 144 stations, many had rating ratios (RR) considerably greater

than 1. From the histogram of RR values shown in Figure 9 it can be seen that 90% of the

RR values for all the recorded annual maxima lie between 1 and 20. A RR value

significantly greater than 1 could magnify the errors in flood frequency quantile estimates

but, on the other hand, rejecting all stations with a RR greater than one would reduce the

number of stations below the minimum required for a meaningful RFFA. Thus, it is

decided that a cut-off RR value of 20 would be reasonable, which has reduced the eligible

number of stations from 144 to 131 for VIC. The impacts of RR values on flood quantile

estimates are presented in section 4.5.3.

Victoria

384

11161

19 18 18

9 10 10

4 5

1

4

2

4

1 1

23

2

1

2

0 0

5

4387

1

10

100

1000

10000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50More

Rating Ratio - RR

Fre

quen

cy

90% of rating ratio ’s lie between 1 & 20

Figure 9 Histogram of rating ratios of annual maximum flood data in Victoria (stations with record

lengths > 25 years)

The results of the outlier detection procedure are summarised here: (a) Some 43% of the

stations are found to have low outliers. The maximum number of low outliers detected in a

data series is 5 and never exceed 19% of the total number of data points in a series. (b)

Most of the detected low outliers occurr for stations which are located in low rainfall areas,

especially in the western part of VIC. (c) 31% of low outliers occurred in the years 1982

and 1967. Severe drought occurred during these years with the maximum annual flows in

many rivers being baseflow rather than a flood. Similar results were reported by Rahman

CHAPTER 4

98

(1997). (d) 55% of the stations do not show any outliers. Even the values in the drought

years of 1967 and 1982 are not low enough to be treated as low outliers. The locations of

most of these stations are in the south-eastern part of Victoria. (e) Only 1 station shows a

high outlier. The detected low outliers are treated as censored flows in flood frequency

analysis using FLIKE (that is, the information that there is no flood in that year is taken

into account).

The final VIC database contains 131 stations whose record lengths range from 25 to 52

years (mean and median: 32 years and standard deviation: 5 years). Some 87% of the

stations have record lengths in the range 25-35 years, 8% of the stations in the range 35-45

years and 5% of the stations in the range 50-55 years. The catchment areas range from 3 to

997 km2 (mean: 321 km2 and median: 289 km2). Some 15 catchments (11%) are in the

range of 3 to 50 km2, 11 catchments (8%) are in the range of 51 to 100 km2, 78 catchments

(60%) are in the range of 101 to 500 km2; and 27 catchments (21%) are in the range of 501

to 997 km2. The histogram of streamflow record lengths of the 131 stations is shown in

Figure 10. The distribution of catchment areas is shown in Figure 11. The geographical

distribution of these stations is shown in Figure 16, which shows that there is no station in

north-western VIC that has passed the selection criteria. This region indeed is characterised

by very low runoff and ephemeral streams.

Figure 10 Distributions of streamflow record lengths of the selected 131 stations from Victoria

23

78

20

3 52

0

10

20

30

40

50

60

70

80

90

25 - 29 30 - 34 35 - 39 40 - 44 45 - 50 51 - 55

Record Length (years)

Fre

quen

cy

CHAPTER 4

99

6

20

24

18

23

13

6

10

45

2

0

5

10

15

20

25

30

0 - 25 26 -100

101 -200

201 -300

301 -400

401 -500

501 -600

601 -700

701 -800

801 -900

901 -1000

Catchment Area (km 2)

Fre

quen

cy

Figure 11 Distributions of catchment areas of the 131 catchments from Victoria

4.5.2 DATA PREPARATION FOR NSW AND ACT

Initially, a total of 635 stations are selected from NSW and the Australian Capital Territory

(ACT). After in-filling the gaps and using the selection criteria discussed in section 4.3,

only 294 stations are retained with a minimum of 10 years of annual maximum flood data.

The Mann-Kendall test, time series plot inspection and CUSUM test resulted in some 11%

of the stations (31 stations) being identified as having a decreasing trend, generally after

1990. A cut-off record length of 25 years is adopted similar to Victoria, which has reduced

the number of eligible stations to 106, which is only 17% of the initially selected 635

stations.

In the remaining data set of 106 stations from NSW, many had RR values considerably

greater than 1 – see Figure 12. As for the VIC data, a cut-off RR value of 20 is adopted,

which has reduced the eligible number of stations from 106 to 96.

CHAPTER 4

100

Figure 12 Histogram of rating ratios for 106 stations from NSW

Some 40% of the stations from NSW and ACT are found to have low outliers. The

maximum number of low outliers detected in a data series is 9 and has never exceeded 21%

of the total number of data points in a series. Most of these detected low outliers occur for

stations located in low rainfall areas, especially in the western parts of NSW. Some 31% of

low outliers occur in the years 1967, 1982 and 1994. About 47% of the stations do not

show any outliers. Only 5 stations have shown a high outlier. The record lengths of the 96

stations range from 25 to 74 years (mean: 34 years, median: 31 years and standard

deviation: 10 years). Some 77% of the stations have record lengths in the range 25-35

years, and 18% of the stations in the range 40-55 years; and 5% in the range 60-75 years.

The catchment areas range from 8 to 1010 km2, with an average value of 353 km2, median

of 267 km2 and a standard deviation of 276 km2. Some 9 catchments (9%) are in the range

of 8 to 50 km2, 9 catchments (9%) are in the range of 51 to 100 km2, 52 catchments (54%)

are in the range of 101 to 500 km2 and 27 catchments (28%) are in the range of 501 to 1010

km2. The histogram of streamflow record lengths of the 96 stations is shown in Figure 13.

The distribution of catchment areas is shown in Figure 14. The geographical distribution of

the 96 stations is shown in Figure 16. There is no station in far western NSW that has

passed the selection criteria.

NSW

774

222

9967 61

2113

9 85 5

2

4

0

5

0

2

2162

1

10

100

1000

10000

1 3 5 7 9 12 14 16 18 20 22 24 26 28 30 35 40 45

Rating Ratio - RR

Fre

quen

cy

Over 95% of rating ratiosbetween 1 and 20

CHAPTER 4

101

7

41

26

5 5 5

2 2 20

1

0

5

10

15

20

25

30

35

40

45

25 - 29 30 - 34 35 - 39 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 65 - 69 70 - 74 >75


Fre

quen

cy

Figure 13 Distributions of streamflow record lengths of the selected 96 stations from NSW

89

20

1312

78

45

6

3

1

0

5

10

15

20

25

0 - 25 26 - 100 101 -200

201 -300

301 -400

401 -500

501 -600

601 -700

701 -800

801 -900

901 -1000

>1000

Catchment Area (km 2)

Fre

quen

cy

Figure 14 Distributions of catchment areas of the 96 catchments from NSW

CHAPTER 4

102

4.5.3 SENSITIVITY ANALYSIS - IMPACT OF RATING CURVE ERROR ON

FLOOD QUANTILE ESTIMATES

To assess the impact of rating curve error (expressed in terms of RR) on flood quantile

estimates, the FLIKE software, which implements the principles outlined in Kuczera

(1999a, b), is employed to fit the LP3 distribution using the Bayesian parameter fitting

procedure. In this application of FLIKE, no prior information is used with both the ‘no

rating curve error’ and the ‘rating curve error’ cases. The flow closest to RR = 1 is used as

the “anchor point” in the rating curve error model inbuilt in FLIKE. The flows greater than

RR = 1 are expected to be associated with measurement errors i.e. the higher the RR value

for a data point the greater the degree of rating curve extrapolation error associated with it

(see Figure 7). In the flood frequency analysis using FLIKE for the ‘rating error’ case, less

weight is assigned to the flow data points beyond the anchor point (which represents higher

flows).

Three cases are considered here for illustration purposes where flows in excess of the

anchor point are corrupted by a multiplicative error assumed to be log-normally distributed

with mean one and coefficient of variation (CV) equal to 10%, 20% and 30%. Also, four

different values of maximum RR are considered (5, 10, 20 and 40). Four stations from the

database for VIC and NSW are selected with maximum RR values in the range of 5-40:

Station 210040 (RR = 5), Station 222213 (RR = 10), Station 234209 (RR = 20) and Station

221201 (RR = 40).

Table 1 presents the flood quantile estimates using FLIKE for four scenarios where the

coefficient of variation of multiplicative errors CV equal to 0%, 10%, 20% and 30%. For

each of these four scenarios, stations with maximum RR values of 5, 10, 20 and 40 are

analysed. Table 1 presents the expected quantile and the lower and 95% confidence limits

for the 50- and 100-year flood. To assist interpretation, the results for the cases where

rating curve error is assumed present (i.e., CV > 0) are expressed as ratios for the case CV

> 0 to the case CV = 0.

CHAPTER 4

103

Table 1 Flood quantile estimates and associated errors using ARR FLIKE with and without consideration of rating curve error

(MMF = maximum measured flow)

Rating error CV = 0% Ratio for CV = 10%

and CV=0%

Ratio for CV = 20% and CV=0%

Ratio for CV = 30% and CV=0%

LL 95%

Expected UL 95%

Expected, %

CL width,

%

Expected, %

CL width, %

Expected, %

CL width,

%

Station Maximum

RR

ARI of MMF (yr)

50-year flood quantile

210040 5 2.77 778 1567 4753 102.5 105.2 111.4 123.5 121.4 147.3

222213 10 1.80 101 175 416 101.6 109.0 103.4 118.3 104.2 133.1

234209 20 1.03 22 28 46 134.4 189.0 133.7 193.3 149.0 244.2

221201 40 3.77 281 397 693 108.7 126.8 120.0 167.0 138.1 224.7

100-year flood quantile

210040 5 2.77 1018 2270 8854 103.5 105.2 114.5 126.6 127.0 146.9

222213 10 1.80 123 235 682 102.2 107.8 104.1 116.7 105.2 132.1

234209 20 1.03 23 30 56 137.9 185.4 142.5 196.7 161.3 250.4

221201 40 3.77 321 465 912 111.4 129.9 126.1 176.4 149.3 249.0

CHAPTER 4

104

The results show that the width of the 95% quantile confidence limits increases with

increasing rating curve error CV reflecting the fact that errors in estimating the bigger flood

flows reduce the information content of the higher flows. Indeed, in the worst case, the

confidence limit width increases by 250%. Moreover, the bias in quantile estimates

increases with increasing CV, in some cases reaching 50% to 60%. This confirms the

soundness of the eliminating stations judged to have poor quality ratings.

Of interest is the relationship of quantile bias and accuracy with maximum RR. It appears

that as the maximum RR increases, the bias and uncertainty in the quantiles tends to grow

for a given rating curve error CV. The trend is somewhat obscured by the fact that the ARI

of the maximum measured flow (i.e. the anchor point) varies. As the ARI of the anchor

point grows fewer flows are affected by rating curve errors; for example, if the ARI of the

anchor point is 2 years, then half of the data will lie below the anchor point, largely

unaffected by rating curve error. Thus one can see that station 221201 which has maximum

RR of 40 but an anchor point ARI of 3.77 years has similar bias and accuracy to Station

234209 which has a lower maximum RR of 20 but an anchor point ARI of 1.03 years.

Although this analysis is not conclusive, it does suggest that stations with high maximum

RR values are likely to be problematic unless some form of compensation for rating curve

error is made.

4.6 SUMMARY RESULTS OF STREAMFLOW DATA PREPARATION FOR

THE OTHER STATES

The methods applied in section 4.5 are applied to gauged flood data in the entire Australian

continent. In this section we present the summary results for QLD, TAS, Northern

Territory (NT), Western Australia (WA) and South Australia (SA). Further results can be

found in Rahman et al. (2009 and 2011b). This section also presents a summary of the final

catchments adopted for this study.

4.6.1 TASMANIA

A total of 53 catchments have been selected from TAS. The record lengths of annual

maximum flood series of these 53 stations range from 19 to 74 years (mean: 30 years,

median: 28 years and standard deviation: 10.43 years). The catchment areas of the selected

53 catchments range from 1.3 km2 to 1900 km2 (mean: 323 km2 and median: 158 km2). The

geographical distribution of the selected 53 catchments is shown in Figure 16.

CHAPTER 4

105

4.6.2 QUEENSLAND

A total of 172 catchments have been selected from QLD. The record lengths of annual



172 catchments range from 7 km2 to 963 km2 (mean: 325 km2, median: 254 km2). The


4.6.3 SOUTH AUSTRALIA

A total of 29 catchments have been selected from SA. The record lengths of annual



29 catchments range from 0.6 km2 to 708 km2 (mean: 170 km2 and median: 76.5 km2). The


4.6.4 NORTHERN TERRITORY

A total of 55 catchments have been selected from NT. The record lengths of annual



55 catchments range from 1.4 km2 to 4,325 km2 (mean: 682 km2 and median: 360 km2).

The geographical distribution of the selected 55 catchments is shown in Figure 16.

4.6.5 WESTERN AUSTRALIA

A total of 146 catchments have been selected from WA. The record lengths of annual



146 catchments range from 0.1 km2 to 7,405.7 km2 (mean: 323 km2 and median: 60 km2).

The geographical distribution of the selected 146 catchments is shown in Figure 16.

4.6.6 SUMMARY OF STREAMFLOW DATA AUSTRALIA WIDE

A total of 682 catchments have been selected from all over Australia. The record lengths of

the annual maximum flood series of these 682 stations range from 18 to 97 years (mean: 35

years, median: 33 years and standard deviation: 11.5 years). The distribution of record

lengths is shown in Figure 15 (a).

CHAPTER 4

106

The catchment areas of the selected 682 catchments range from 0.1 km2 to 7,405.7 km2

(mean: 350 km2, median: 214 km2). The geographical distribution of the selected 682

catchments is shown in Figure 16. The distribution of catchment areas of these stations is

shown in Figure 15 (b).

71-9761-7051-6041-5031-4026-3018-25

300

200

100

0


Fre

qu

ency

2001-74061001-2000501-1000101-50051-10021-500.1-20

300

200

100

0

Catchment Area (km-sq)

Fre

qu

ency

171129

84

307

149

85

915

135

301

766780

a

b

Figure 15 (a) Distribution of annual maximum flood record lengths of 682 stations from all over

Australia (b) Distribution of catchment areas of 682 stations from all over Australia

CHAPTER 4

107

Figure 16 Geographical distributions of the selected 682 stations from all over Australia

The summary of all the Australian data prepared as a part of this study is provided in Table

2.

Table 2 Summary of selected stations Australia wide

State No. of

stations Median streamflow record length

(years) Median catchment size

(km2)

NSW and ACT 96 34 267

VIC 131 33 289

SA 29 34 76.5

TAS 53 28 158

QLD 172 36 254

WA 146 30 60

NT 55 33 360

Total 682 - -

CHAPTER 4

108

4.7 SELECTION AND ABSTRACTION OF CATCHMENT

CHARACTERISITCS

Catchment characteristics used in many previous RFFA studies were summarised by

Rahman (1997). He grouped the catchment characteristics under the headings of climatic

characteristics, morphometric characteristics, catchment cover and land use characteristics,

geological and soil characteristics, catchment storage characteristics, and location

characteristics. Many catchment characteristics are highly correlated, and the inclusion of

strongly correlated variables in prediction equations does not add any new information; it

also causes problems in statistical analysis (e.g. multicollinearity). The following

guidelines can be useful in making a reasonable selection:

The characteristics should have a plausible role in flood generation.

They should be unambiguously defined.

Characteristics should be easily obtainable. When a simpler characteristic and a

complex one are correlated and have similar effects then the simpler characteristic

should be chosen.

If a derived/combined characteristic is used, it should have a simple physical

interpretation.

The characteristics in the selected set should not be highly correlated, because this

results in unstable parameters in hydrologic regression analysis.

The prediction performance of a characteristic in other regionalisation studies should be

taken into account, as this can give some general idea regarding the importance of the

characteristic.

Based on the hydrological significance, correlations and ease of the data abstraction, eight

catchment characteristics are included in this study as listed in Table 3, and described

below.

Catchment area: Catchment area is the main scaling factor in the flood process and

directly affects the potential flood magnitude from a given storm event. The total volume of

runoff (Q) is proportional to the area of the catchment area (A), and of the general form:

Q = cAm (4.2)

CHAPTER 4

109

where the exponent m varies from 0.5 to 1.00.

Table 3 Catchment characteristics variables used in the study

Catchment characteristics

1. area: Catchment area (km2)

2. I: Design rainfall intensity (mm/h)

3. rain: Mean annual rainfall (mm)

4. evap: Mean annual areal potential evapotranspiration (mm)

5. S1085: Slope of the central 75% of mainstream (m/km)

6. sden: Stream density (km/km2)

7. forest: Fraction of catchment area under forest.

8. qsa: Fraction quaternary sediment area (VIC only).

Almost all of the reported RFFA studies have found catchment area to be very significant.

One of the reasons why the area variable has been so useful in statistical hydrology is its

association with other significant morphometric characteristics like slope, stream length

and stream order. Area was characterised by Anderson (1957) as the ‘devil’s own variable’,

because almost every watershed characteristic is correlated with it. As in the case of area,

the mean annual flood is directly proportional to other morphometric characteristics, which

are again directly proportional to area.

In this study, catchment area is obtained from 1:100,000 topographic maps which are

readily available for large parts of Australia.

Rainfall intensity: Storm rainfall intensity (IARI,d), for an appropriate burst duration (d) and

average recurrence interval (ARI), has been found to be the most significant predictor

climatic characteristic in previous RFFA studies. This is to be expected given the strong

causal link between intensity and peak flow. Importantly, this data is simple to obtain from

the published data (e.g. ARR1987 Volume 2).

The use of rainfall intensity requires the selection of an appropriate storm burst duration

and ARI. It seems to be logical to use a design rainfall intensity with a duration equal to the

time of concentration (tc), as suggested in the probabilistic rational method (I.E. Aust.,

1987, 2001). This is because as catchment area gets bigger, tc gets longer, which results in

smaller average design rainfall intensity. However, there are different methods to estimate

CHAPTER 4

110

tc e.g. Bransby Williams formula and Friend formula (I.E. Aust., 2001). For consistency,

and ease of application, the formula recommended in ARR 1987 for VIC and eastern NSW,

given by Equation 4.3, is adopted in this study.

38.076.0 Atc (4.3)

where tc is time of concentration in hours and A is catchment area in km2.

In addition to the design rainfall intensity for a given ARI and tc (IARI,tc), rainfall intensities

with fixed durations and ARIs are also trialled e.g. rainfall intensities with 2 and 50 years

ARIs and 1 and 12 hours durations.

The various design rainfall intensities data for the selected study catchments are obtained

using the intensity frequency duration (IFD) Calculator on the BOM website or the design

data in ARR Volume 2.

Mean annual rainfall: Mean annual rainfall has been used frequently in many previous

RFFA studies. It may not have a direct link with flood peak, but it acts as a surrogate for

some other characteristics (e.g. vegetation and wetness index) and is readily available.

Thus, mean annual rainfall is included as a predictor variable in this study. The data for the

mean annual rainfall for each catchment is extracted from the BOM Data CD of Annual

Rainfall.

Mean annual evaporation: This relates to the main loss component in the rainfall-runoff

process. It is readily available and thus is included in this study. The mean annual areal

potential evapotranspiration data for each catchment is extracted from the BOM Data CD

of Evaporation.

Slope: Slope is significant for any gravitational flow. With other catchment characteristics

held constant the steeper the slope the greater the velocity of flow. Both overland and

channel slope are important. Overland slope influences the velocity of shallow surface

flow; hence, it can be expected to be of more importance for smaller catchments where the

time spent in overland flow is a significant percentage of the total time needed for water to

CHAPTER 4

111

reach the catchment outlet. For larger catchments, channel slope is relatively more

important than overland slope.

There are several measures of slope; the most common of these are:

Equal area slope: This is the slope of a straight line drawn on a profile of a stream such

that the line passes through the outlet and has the same area under and above the stream

profile.

Average slope: This is equal to the total relief of the main stream divided by its length.

S1085: This excludes the extremes of slope that can be found at either end of the

mainstream. It is the ratio of the difference in elevation of the stream bed at 85% and 10%

of its length from the catchment outlet, and 75% of the main stream length.

Areal slope: This involves measuring the slope at a large number of points within a

catchment and then determining an average areal slope.

Taylor and Schwarz (1952) slope: This assumes that velocity in each reach of a subdivided

mainstream is related via the Manning’s equation to the square root of slope. This index is

equivalent to the slope of a uniform channel having the same length as the longest water

course and an equal time of travel.

In previous studies Strahler (1950) has shown that the overland slope and channel slope are

strongly correlated. Benson (1959) found that S1085 gave the best prediction of the mean

annual flood. The S1085 is closely correlated with the Taylor and Schwarz slope (NERC,

1975).

From the different measures of slope, S1085 is deemed adequate and the simplest to

estimate from 1:100,000 topographic maps and thus has been adopted in this study.

Stream density: This is directly related to drainage efficiency of a catchment, and has been

included in this study where possible. The definition of stream density is total stream

length, which is taken as the sum of the length of all the blue lines in catchment as shown

CHAPTER 4

112

on 1:100,000 topographic maps, divided by catchment area. The length of the blue lines

can be measured by opisometer/electronic distance meter or can be obtained using GIS.

Stream density is not easy to measure and also the measured value depends on the map

scale used. It should be retained in the final prediction equation only if it delivers

significantly improved design flood estimates. Also, if it is used in final flood prediction

equations, the procedure should stress the map scale to be used in its measurement.

Forest area: The effect of vegetation on catchment response has been studied by many

researchers (e.g. Flavell and Belstead, 1986; Williamson and Vand Der Wel, 1991; Flavell,

1982). Forest reduces runoff by precipitation interception and transpiration. For a surface

without a canopy or leaf litter layer, the interception loss is lower and overland flow travels

more rapidly with less opportunity time for infiltration. Hence, Flavell (1982) found that

losses from rainfall decrease with increased clearing and that the runoff coefficient of the

rational method increases with increased clearing. Fraction forest cover has been included

in this study. The fraction of catchment covered by forest is estimated on 1:100,000

topographic maps by using a planimeter to measure the areas designated as dense and

medium forest, and dense and medium scrub.

Quaternary sediment area (VIC only): Storage directly affects the shape of the flood

hydrograph, however defining storage as a single parameter is difficult. Quaternary

sediment area appears to be an influential surrogate for storage, because it s a good

indicator of floodplain extent variability in a catchment. Values for quaternary sediment

area are determined from 1:250,000 geological maps.

4.8 SUMMARY

The first part of this chapter has examined various aspects of the streamflow data collation

adopted for this thesis. A total of 682 catchments have been selected from the continent of

Australia (excluding the arid region see Figures 5 and 16). The annual instantaneous

maximum flood series of the stations have been collected, gaps filled, rating curve

extrapolation errors identified, trends and shifts in data analysis identified and outlier points

censored. A sensitivity analysis has also been undertaken to understand the impacts of

rating curve error on flood quantile estimation. The second part of this chapter has

examined the candidate catchment characteristics for this study, a brief explanation has

CHAPTER 4

113

been given about each variable and how these data have been obtained. All the variables

listed in Table 3 are used in the analyses presented in the subsequent chapters of this thesis.

CHAPTER 5

114

CHAPTER 5: RESULTS – RFFA BASED ON FIXED REGIONS AND

REGION OF INFLUENCE APPROACHES UNDER THE QUANTILE

AND PARAMETER REGRESSION FRAMEWORKS

5.1 GENERAL

This chapter develops flood prediction equations (for 6 average recurrence intervals

(ARIs), which are 2, 5, 10, 20, 50 and 100 years) using both a fixed region and region of

influence (ROI) approach in a quantile regression technique (QRT) and parameter

regression technique (PRT) framework. The ROI approach is adopted to reduce the degree

of heterogeneity present in Australian annual maximum flood regions to enhance the

accuracy in design flood estimates. The Bayesian generalised least squares regression

(BGLSR) technique is adopted for the parameter estimation which explicitly accounts for

the inter-station correlation present in the annual maximum flood series (AMFS) data and it

distinguishes between the sampling and model errors in regression analysis. The developed

prediction equations allow for design flood or flood statistic estimates to be made at an

ungauged catchment given the relevant catchment characteristics data. To assess the

performances of the developed prediction equations, a Leave-one-out (LOO) validation

procedure is adopted. The basic theory and assumptions associated with the QRT and PRT

in a ROI BGLSR framework have been discussed in Chapter 3.

5.1.1 PUBLICATIONS

Four journal papers (ERA, ranks A*, A, B and B) have been published based on the results

presented in this chapter. These journal papers are given in Appendix A and noted below:

Haddad, K. and Rahman, A. (2012). Regional flood frequency analysis in eastern

Australia: Bayesian GLS regression-based methods within fixed region and ROI

framework: Quantile Regression vs. Parameter Regression Technique. Journal of

Hydrology, 430-431, 142-161.

Haddad, K., Rahman, A. and Stedinger, J. R. (2012). Regional Flood Frequency Analysis

using Bayesian Generalized Least Squares: A Comparison between Quantile and Parameter

Regression Techniques. Hydrological Processes, 25, 1-14.

CHAPTER 5

115

Haddad, K., Rahman, A. and Kuczera, G. (2011). Comparison of Ordinary and

Generalised Least Squares Regression Models in Regional Flood Frequency Analysis: A

Case Study for New South Wales. Australian Journal of Water Resources, 15(2), 1-12.

Haddad, K., Zaman, M. and Rahman, A. (2010b). Regionalisation of skew for flood

frequency analysis: a case study for eastern NSW. Australian Journal of Water Resources,

14(1), 33-41.

5.2 RESULTS FOR TASMANIA

5.2.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT

A total of 53 catchments were used from Tasmania for the analyses presented here. The

locations of these catchments are shown in Figure 16. The AMFS record lengths of these

53 stations range from 19 to 74 years (mean 30 years, median 28 years and standard

deviation 10 years). The catchment areas of these 53 stations range from 1.3 to 1,900 km2

(mean 323 km2, median 158 km2 and standard deviation 417 km2).

In the fixed region approach, all the 53 catchments were considered to have formed one

region, however, one catchment was left out for cross-validation and the procedure was

repeated 53 times to implement the LOO validation. Hence, the model data set contained

52 catchments in each iteration step. In the ROI approach, an optimum region was formed

for each of the 53 catchments by starting with 15 stations in the first proposed region and

then consecutively adding 1 station at each iteration step.

Table 4 shows the different combinations of predictor variables for the Q10 QRT model and

the models for the first three parameters of the log Pearson Type 3 (LP3) distribution.

Figure 17 and 18 show example plots of the statistics used in selecting the best set of

predictor variables for the Q10 and skew models. According to the model error variance

(MEV), combinations 6, 16, 18, 20, 17, 19 and 4 were potential sets of predictor variables

for the Q10 model. Combinations 16, 18, 20, 17, 19 and 4 contained 3 to 4 predictor

variables, while combinations 6 and 4 contained 2 predictor variables. Indeed, combination

6 with the 2 predictor variables (area and design rainfall intensity 50I12) showed the lowest

MEV and the highest pseudo coefficient of determination ( 2

GLSR ). The average variance of

CHAPTER 5

116

prediction old (AVPO), average variance of prediction new (AVPN), Akaike information

criteria (AIC) and Bayesian information criteria (BIC) values favour combination 6 as well.

Combination 6 was compared to combination 10 (the latter also contains 2 predictor

variables, area and design rainfall intensity Itc,10). Combination 6 had a smaller MEV while

also showing the regression coefficient for variable 50I12 to be 5.5 times the posterior

standard deviation away from zero, as compared to 4 times for Itc,10. Hence, combination 6

was finally selected as the best set of predictor variables for the Q10 model.

For the skew model, combination 4 showed the lowest MEV (0.034) and the highest R2GLS

(52%) (Figure 18), as well as the lowest AIC and BIC. Combination 1 without any

explanatory variables ranked 13 out of the 16 possible combinations (MEV of 0.045); it

also showed higher AVPO and AVPN as compared to combination 4, hence combination 4

was finally selected.

A similar procedure was adopted in selecting the best set of predictor values for other

models with the QRT and PRT. The sets of predictor variables selected as above were used

in the LOO validation with fixed regions and ROI approaches.

The Bayesian plausibility values (BPV) for the regression coefficients associated with the

QRT over all the ARIs were between 2% and 8% for the variable area and 0.000% for

design rainfall intensity 50

I12. This justifies the inclusion of predictor variables area and

50I12 in the prediction equations for QRT. The BPVs for the skew model were 23% and

11% for area and 50

I1, respectively indicating these variables are not very good predictors

for skew. The BPVs for the mean model were close to 1% for both the predictor variables.

For the standard deviation model, the BPV for the predictor variable rain was 1%.

Regression equations developed for the QRT and PRT for the fixed region are given by

Equations 5.1 to 5.9:

ln(Q2) = 4.18 + 0.91(area) + 3,35(50I12) (5.1)

ln(Q5) = 4.59+ 0.89(area) + 2.80(50I12) (5.2)

ln(Q10) = 4.87 + 0.85(area) + 2.57(50I12) (5.3)

ln(Q20) = 5.09 + 0.84(area) + 2.39(50I12) (5.4)

ln(Q50) = 5.45 + 0.84(area) + 2.23(50I12) (5.5)

ln(Q100) = 5.48 + 0.82(area) + 2.02(50I12) (5.6)

CHAPTER 5

117

ln(Q ) = 4.00 + 0.90(area) + 3.85(2I12) (5.7)

stdev = 0.64 + 0.55(rain) (5.8)

skew = – 0.05 + 0.07(area) + 1.20(50I1) (5.9)

It is reassuring to observe that the regression coefficients in the QRT set of equations vary

in a regular fashion with increasing ARI.

Table 4 Different combinations of predictor variables considered for the QRT models and

the parameters of the LP3 distribution (QRT and PRT fixed region Tasmania)

Combination Combinations for mean,

standard deviation & skew

models

Combinations for flood quantile

models

1 Const Const

2 Const, area Const, area

3 Const, area, (2I1) Const, area, 2I1




7 Const, area, rain Const, area, rain

8 Const, area, forest Const, area, forest

9 Const, area, evap Const, area, forest, evap

10 Const, area, S1085 Const, area, Itc,ARI

11 Const, area, sden Const, area, evap

12 Const, sden, rain Const, area, S1085

13 Const, forest, rain Const, area, sden

14 Const, S1085, forest Const, sden, rain

15 Const, evap Const, forest, rain

16 Const, rain, evap Const, area, 50I12, rain

17 Const, rain Const, area, 50I12, sden

18 - Const, area, 50I12, rain, evap

19 - Const, area, 50I12, Itc,ARI, evap

20 - Const, area, 50I12, Itc,ARI, rain, evap

21 - Const, area, 50I12, Itc,ARI, sden

22 - Const, area, 50I12, Itc,ARI, S1085

23 - Const, area, Itc,ARI, evap

24 - Const, area, Itc,ARI, rain

CHAPTER 5

118

25 - Const, area, 2I1, Itc,ARI

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6 16 18 20 17 19 4 22 24 21 10 23 3 7 12 5 11 9 2 8 13 15 14 1 25

Combination of Catchment Characteristics

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%MEV Standard Error of MEV R-sqd GLS

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25


AVPO AVPN AIC BIC

Figure 17 Selection of predictor variables for the BGLSR model for Q10 (QRT, fixed region Tasmania),

MEV = model error variance, AVPO = average variance of prediction (old), AVPN = average variance

of prediction (new), AIC = Akaike information criterion, BIC = Bayesian information criterion, note 2

GLSR uses right hand axis

CHAPTER 5

119

Figure 18 Selection of predictor variables for the BGLSR model for skew

5.2.2 PSUEDO ANOVA WITH QRT AND PRT MODELS FOR THE FIXED AND

ROI REGIONS

The pseudo analysis of variance (ANOVA) tables for the Q20 and Q100 models and the

parameters of the LP3 distribution are presented in Tables 5 – 9 for the fixed regions and

ROI. This is an extension of the ANOVA in ordinary least squares regression (OLSR)

which does not recognise and correct for the expected sampling variance (Reis et al., 2005).

For the LP3 parameters, the sampling error increases as the order of moment increases i.e.

the error variance ratio (EVR) increases with the order of the moments. An EVR of greater

than 0.20 may indicate that the sampling variance is not negligible when compared to the

model error variance, which suggests the need for a GLSR analysis (Gruber et al., 2007).

The ROI shows a reduced MEV (i.e. a reduced heterogeneity) as compared to the fixed

regions, as fewer sites have been used. The model error dominates the regional analysis for

the mean flood and the standard deviation models for both the fixed regions and the ROI.

However, the ROI shows a higher EVR than the fixed region case, e.g. for the mean flood

model the EVR is 0.20 for the ROI and 0.06 for the fixed region (Table 7). For the standard

deviation model the EVR is 0.66 for the ROI and 0.54 for the fixed region, which is a 12%

0.00

0.05

0.10

0.15

0.20

0.25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


AVPO AVPN AIC BIC

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

4 3 6 15 9 5 16 10 11 2 7 12 1 13 8 14


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%MEV Standard Error of MEV R-sqd GLS

CHAPTER 5

120

increase in EVR (Table 8). This shows that the ROI indeed deals better with heterogeneity,

even if only slightly.

The EVR values for the skew model are 9 and 9.3 for the fixed regions and ROI

respectively (Tables 9), which are much higher than the recommended limit of 0.20. Again

the GLSR should be the preferred modeling choice over the OLSR. Given that the skew

model has a high sampling error component, an OLSR model would give misleading

results. The advantage of GLSR is that it can distinguish between the variance due to

model error and sampling error as explained in Chapter 2. Importantly, the Bayesian

procedure adds another dimension to the analysis, by computing expectations over the

entire posterior distribution. It has provided a more reasonable estimate of the MEV where

the method of moment’s estimator would have been grossly underestimated the model error

variance, as the sampling error has overwhelmed the analysis. As far as the ROI is

concerned, there is little change in the EVR as compared to the fixed region, as the skew

model tends to include more stations in the regional analysis.

Pseudo ANOVA tables were also prepared for the flood quantile models. For example,

Tables 5 and 6 show the results for the Q20 and Q100 models, respectively. Here the ROI

shows a higher EVR than the fixed region. This suggests that the BGLSR should be used

with ROI in developing the flood quantile models, especially as the ARI increases.


Source Degrees of freedom Sum of squares

Fixed region ROI Equations Fixed region

ROI

Model k=3 k=3 n )( 22

0 = 34.3 37.5

Model error n-k-1=48 n-k-1=30

n )( 2 = 15.5 12.2

Sampling error N = 52 N = 34 )]ˆ([ ytr = 2.08 1.99

Total 2n-1 = 103 2n-1 = 67

Sum of the above

= 51.9 51.7

EVR 0.13 0.16

CHAPTER 5

121


Table 7 Pseudo ANOVA table for the mean flood model for Tasmania (PRT, fixed region

and ROI)

Table 8 Pseudo ANOVA table for the standard deviation model for Tasmania (PRT, fixed

region and ROI)


Fixed region ROI Fixed region

ROI

Model k=3 k=3 30.7 34.1

Model error n-k-1=48 n-k-1=20 19.0 15.7

Sampling error N = 52 N = 52 3.3 3.13

Total 2n-1 = 103 2n-1 = 103

Sum of the above

= 53.0 52.9

EVR 0.17 0.2



ROI

Model k=3 k=3 n )( 22

0 = 30.5 54.6

Model error n-k-1=48 n-k-1=24 n )( 2 = 17.8 7.1

Sampling error N = 52 N = 28 )]ˆ([ ytr = 1.13 1.02

Total 2n-1 = 103 2n-1 = 55

Sum of the above

= 49.4 63

EVR 0.06 0.2


Fixed region ROI

Fixed region ROI

Model k=2 k=2 3.6 3.5

Model error n-k-1=49 n-k-1=33 3.6 3.3


Total 2n-1 = 103 2n-1 = 103

Sum of the above

= 9.1 9.0

EVR 0.54 0.66

CHAPTER 5

122

Table 9 Pseudo ANOVA table for the skew model for Tasmania (PRT, fixed region and

ROI)

5.2.3 ASSESMENT OF MODEL ASSUMPTIONS AND REGRESSION

DIAGNOSTICS

To assess the underlying model assumptions (i.e. the normality of residuals), the plots of

the standardised residuals vs. predicted values were examined. The predicted values were

obtained from LOO validation. Figures 19 to 20 show the plots for the flood quantile Q20

for the fixed region and ROI using the QRT and PRT framework. The underlying model

assumptions are satisfied to a large extent, as 95% of the standardised residuals values fall

between the limits of ± 2. The ROI shows standardised residuals closer to the ± 2 limits.

The results in Figures 19 to 20 reveal that the developed equations satisfy the normality of

residual assumption quite satisfactorily. Also no specific pattern (heteroscedasicity) can be

identified, with the standardised values being almost equally distributed below and above

zero. Similar results were obtained for the skew, standard deviation and other flood

quantile models, which are not shown in this thesis due to space constraints.



ROI

Model k=3 k=3 0.62 1.80

Model error n-k-1=48 n-k-1=46 1.74 1.54


Total 2n-1 = 103 2n-1 = 99

Sum of the above

= 17.8 17.7

EVR 9.0 9.3

CHAPTER 5

123

-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

1 2 3 4 5 6 7 8

Fitted LN(Q 20)

Sta

ndar

dise

d R

esid

ual

BGLSR-QRT (FIXED REGION) BGLSR-PRT (FIXED REGION)

Figure 19 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, fixed

region, Tasmania)

Figure 20 Plots of standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT, ROI,

Tasmania)

The QQ-plots of the standardised residuals (Equation 3.42) vs. normal score (Equation

3.43) for the fixed region (based on LOO validation) and ROI were also examined. Figures

21 and 22 present results for the Q20 flood quantile model, which shows that all the points

closely follow a straight line. This indicates that the assumption of normality and the

homogeneity of variance of the standardised residuals have largely been satisfied. The

standardised residuals are indeed normally and independently distributed N(0,1) with mean

0 and variance 1 as the slope of the best fit line in the QQ-plot, which can be interpreted as

the standard deviation of the normal score (Z score) of the quantile, should approach 1 and

the intercept, which is the mean of the normal score of the quantile should approach 0 as

the number of sites increases. It can be observed from Figures 21 and 22 that the fitted lines

-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

1 2 3 4 5 6 7 8

Fitted LN(Q 20)

Sta

ndar

dise

d R

esid

ual

BGLSR-QRT (ROI) BGLSR-PRT (ROI)

CHAPTER 5

124

for the developed models pass through the origin (0, 0) and have a slope approximately

equal to one. The ROI approach approximates the normality of the residuals slightly better

(i.e. a better match with the fitted line) than the fixed region approach. Similar results were

also found for the mean, standard deviation, skew and other flood quantile models, which

are not shown in this thesis due to space constraints.

ARI 20 (FIXED REGION)

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-3 -2 -1 0 1 2 3

Standardised Residual

Nor

mal

Sco

re

BGLSR-QRT

BGLSR-PRT

Figure 21 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, fixed

region, Tasmania)

ARI 20 (ROI)

-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT

BGLSR-PRT

Figure 22 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, ROI,

Tasmania)

To assess the adequacy of the BGLSR models, Cook’s distance values were also

calculated. No outlier/influential sites were found for the mean, standard deviation and

flood quantile models. For the skew model (Figure 23), sites 8 and 50 were above the

threshold value of 0.076 (i.e. 4/53, where 53 is the total number of sites). Site 8 showed the

CHAPTER 5

125

largest standardised residual value. The flow data, site history and flood frequency plots of

these two sites were examined. It was found that site 8 had a record length of 33 years (in

the top 20%) and a very small annual maximum flow value in 1968, which was not

surprising as this was a drought year. This small flow caused a high negative skew of -1.60

for the site. Site 50 had record length of 46 years (5th largest record length) and a skew

value 1.15, and it did show the largest influence value (Figure 23). The regression analysis

was repeated by removing these two sites. Indeed site 8 did influence the analysis with a

notable decrease in the expected MEV ( 2 ) from 0.052 to 0.034. The AVPO and AVPN

dropped notably from 0.073 and 0.067 to 0.053 and 0.049, respectively. The 2GLSR

also

increased from 36% to 53%, which is deemed to be a remarkable increase. The effective

record length based on AVPN of 0.049 in this case is 122 years, which is nearly 4 times the

average record length for Tasmania. Site 8 did therefore influence the results notably and

was therefore removed from the database in subsequent analyses. The removal of site 50

resulted in little improvement in the skew model with a negligible increase in 2GLSR (55%)

and a slightly smaller 2 (0.032) and was therefore retained.

Figure 23 Cook’s distance (Di) for locating outlier sites for skew model based on variable combination 4

The summary of various regression diagnostics (the relevant equations are described in

section 3.8) is provided in Table 10. This shows that for the mean flood model, the MEV

and average standard error of prediction (SEP) are much higher than those of the standard

deviation and skew models. This indicates that the mean flood models exhibits a higher

0

0.05

0.1

0.15

0.2

0.25

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53

Site No.

GLSR Cooks D

CHAPTER 5

126

degree of heterogeneity than the standard deviation and skew models, this result is also

supported by the ANOVA analysis. Indeed the issue here is that sampling error becomes

larger as the order of the moment increases, therefore, in case of the skew, the spatial

variation is a second order effect that is not really detectable. For the mean flood model, the

ROI shows a MEV which is 11% smaller than for the fixed region. Also, the 2

GLSR value for

the mean flood model with the ROI is 2% higher than for the fixed region. The reasonable

reduction in MEV alone indicates that the ROI should be preferred over the fixed region

analysis for developing the mean flood model. For the standard deviation model, ROI also

shows 8% smaller SEP and 5% higher 2

GLSR values. This indicates that the ROI is preferable

to the fixed region for the standard deviation model. What is also noteworthy (as seen from

Table 10) is that the SEP% for the skew model is slightly larger for the ROI than the fixed

region analysis. This may be due to the fact that, if the number of sites are reduced (smaller

ROI), the predictive variance may be slightly inflated in the skew region. The 2

GLSR values

for the skew models are similar for the fixed region and ROI, with the latter providing only

a 2% increase.

One can see from Table 10 that the SEP values for all the flood quanitle models are 2% to

11% smaller for the ROI cases than the fixed region; the best result is obtained for ARI = 2

years. Also, the 2

GLSR values for ROI cases are 3% to 6% higher than the fixed region. These

results show that the ROI generally outperforms the fixed region approach.

Table 10 Regression diagnostics for fixed region and ROI for Tasmania

Model Fixed region ROI

MEV AVP SEP (%)

2

GLSR (%)

MEV AVP SEP (%) 2

GLSR (%)

Mean 0.35 0.37 67 86 0.24 0.27 56 88

Stdev 0.071 0.076 28 51 0.042 0.046 20 56

Skew 0.034 0.050 22 52 0.031 0.050 23 54

Q2 0.55 0.59 83 76 0.38 0.419 72 79 Q5 0.33 0.36 61 82 0.25 0.28 57 86 Q10 0.30 0.32 58 84 0.23 0.26 54 87 Q20 0.30 0.33 58 83 0.23 0.26 55 87 Q50 0.34 0.37 62 82 0.27 0.30 60 86 Q100 0.37 0.40 66 79 0.30 0.34 64 85

CHAPTER 5

127

5.2.4 POSSIBLE SUBREGIONS IN TASMANIA

Table 11 shows the number of sites and associated MEVs for the ROI and fixed region

models. This shows that the ROI mean flood model has fewer sites on average (28 out of

52 sites i.e. 54%) than the standard deviation and skew models. The ROI skew model has

the highest number of sites which includes nearly all the sites in Tasmania (50 out of 52 i.e.

96%). The MEVs for all the ROI models (except the skew model) are smaller than the

fixed region models. This shows that the fixed region models experience a greater

heterogeneity than the ROI. If the fixed regions are made too large, the model error will be

inflated by heterogeneity that will go unaccounted for by the catchment characteristics.

Figure 24 shows the resulting sub-regions in Tasmania (with minimum MEVs) for the ROI

mean flood and skew models. For the mean flood and skew models, there are two distinct

sub-regions. The regions can be classified as east and west Tasmania for which there are

two distinct types of rainfall regimes and districts. The significance of this is that if spatial

variations do exist in the hydrological statistic of interest, they are most likely to be

captured by the ROI, as has been the case in this study for Tasmania. The results of this

analysis concur with previous studies (McConachy et al., 2003, Gamble et al., 1998,

Xuereb et al., 2001) which showed that large rainfalls over Tasmania are not

meteorologically homogeneous. In the east of the state, the largest rainfall events occur in

the warmer spring and summer months when low pressure systems in the Tasman Sea can

direct an easterly onshore air flow over Tasmania. The heaviest rainfalls in the west of the

state are due to the passage of fronts, sometimes associated with an intense extratropical

cyclone with a westerly or southwesterly airstream (Xuereb et al., 2001).

Table 11 Model error variances associated with fixed region and ROI for Tasmania (n =

number of sites in the region)

Parameter/ quantiles

Mean Stdev Skew Q2 Q5 Q10 Q20 Q50 Q100

ROI (n) 2ˆ

28 0.24

36 0.042

50 0.031

30 0.38

35 0.25

35 0.23

34 0.23

33 0.27

33 0.30

Fixed region (n) 2ˆ

52 0.35

52 0.067

52 0.034

52 0.55

52 0.33

52 0.30

52 0.30

52 0.34

52 0.37

CHAPTER 5

128

Figure 24 Spatial variations of the grouped minimum model error variances for Tasmania (a) mean

flood model and (b) skew model

5.2.5 EVALUATION STATISTICS

Table 12 presents the relative root mean square error (RMSEr) (Equation 3.45) and relative

error (REr) (Equation 3.44) values for the PRT and QRT models with both the fixed region

and ROI. In terms of RMSEr, ROI clearly gives smaller values than the fixed regions for all

the ARIs. The PRT-ROI shows smaller RMSEr values than the QRT-ROI for all the ARIs,

however for ARIs of 5, 10 and 20 years, the increase is noticeable (i.e. 20 to 30 %). In

terms of REr, ROI gives up to 9% smaller values than the fixed regions. The PRT-ROI

gives larger values of REr (by 13%) for both the 50 and 100 years ARIs. For ARIs of 2 to

20 years, the QRT-ROI gives smaller REr values (by 1% to 13%) than the PRT-ROI.

Finally the results of counting the Qpred/Qobs (rr) ratios for the QRT and PRT for the ROI

and fixed regions are provided in Tables 13 and 14. The QRT-ROI has 85% of the rr values

in the desirable range, compared to 81% for the QRT-fixed region. The PRT-ROI has 78%

of the rr values in the desirable range, compared to 74% for the PRT-fixed region. These

results show that ROI performs better than the fixed regions with both the QRT and PRT.

The PRT-ROI shows 16% underestimation as compared to 8% for the QRT-ROI. The cases

with overestimation were very similar for both the methods.

(a) (b)

CHAPTER 5

129

Table 12 Evaluation statistics (RMSEr and REr) from leave-one-out (LOO) validation for

Tasmania

Model RMSEr (%) REr (%) PRT QRT PRT QRT Fixed

region ROI Fixed

region ROI Fixed

region ROI Fixed

region ROI

Q2 110 100 160 120 33 31 38 30 Q5 90 70 110 80 35 30 34 25 Q10 100 70 110 80 34 37 30 24 Q20 100 70 130 90 36 37 27 27 Q50 110 70 130 100 39 41 29 28 Q100 120 70 130 100 49 42 33 29


Tasmania (fixed region). “U” = gross underestimation, “D” = desirable range and “O” =

gross overestimation

Count (QRT) Percent (QRT) Count (PRT) Percent (PRT) Model

U D O U D O U D O U D O Q2 2 41 9 4 79 17 5 41 6 10 79 12 Q5 2 44 6 4 85 12 6 41 5 12 79 10 Q10 3 46 3 6 88 6 6 41 5 12 79 10 Q20 4 45 3 8 87 6 9 37 6 17 71 12 Q50 6 40 6 12 77 12 10 36 6 19 69 12 Q100 9 38 5 17 73 10 10 36 6 19 69 12

Sum /

average 26 254 32 8 81 10 46 232 34 15 74 11


Tasmania (ROI). “U” = gross underestimation, “D” = desirable range and “O” = gross

overestimation

Count (QRT) Percent (QRT) Count (PRT) Percent (PRT) ARI

(years) U D O U D O U D O U D O 2 3 45 4 6 87 8 6 43 3 12 83 6 5 2 45 5 4 87 10 7 42 3 13 81 6

10 3 45 4 6 87 8 9 41 2 17 79 4 20 4 45 3 8 87 6 9 40 3 17 77 6 50 6 42 4 12 81 8 9 39 4 17 75 8 100 6 42 4 12 81 8 9 39 4 17 75 8

Sum /

average 24 264 24 8 85 8 49 244 19 16 78 6

CHAPTER 5

130

5.3 SECTION SUMMARY

This section of the thesis has compared the fixed region and ROI approaches for the state

of Tasmania. A BGLSR approach was used to develop prediction equations for flood

quantiles of ARIs of 2 to 100 years (for QRT) and the first three parameters of the LP3

distribution (for PRT). It has been found that area and design rainfall intensity are

significant predictors for both the QRT and PRT based prediction equations. When

compared to the fixed region approach, the ROI with both QRT and PRT shows

improvements by reducing the negative influence of regional heterogeneity, with a

decrease in the model error variance, average standard error of prediction and an increase

in the average pseudo 2GLSR . Both the standardised residual and QQ-plots of the ROI

approach satisfy the underlying model assumptions slightly better than those of the fixed

region. It has also been observed that both the QRT-ROI and PRT-ROI produce similar

average root mean square error, median relative error and median Qpred/Qobs ratio values.

Overall, the PRT-ROI and QRT-ROI have performed very similarly for Tasmania. The

ROI approach outperforms the fixed region approach for Tasmania.

5.4 RESULTS FOR NEW SOUTH WALES, VICTORIA AND QUEENSLAND

The analysis undertaken in this section makes use of observed AMFS data of catchments

ranging in areas from 3 to 1010 km2. The finally selected data set consists of n = 399

catchments (Figure 16) with AMFS record lengths ranging from 25 to 94 years (maximum

record length for New South wales (NSW): 75 years, mean and standard deviation: 37 and

11 years, respectively; maximum record length for Victoria (VIC): 52 years, mean and

standard deviation: 33 and 5 years, respectively and maximum record length for

Queensland (QLD): 94 years, mean and standard deviation: 40 and 15 years, respectively).

In the fixed region approach, all the catchments within a state boundary were considered to

have formed one region; however, one catchment was left out for cross-validation and the

procedure was repeated n times to implement the LOO validation scheme. In the ROI

approach, an optimum region was formed for each of the n catchments by starting with 15

stations and then consecutively adding 5 stations at each iteration (see section 3.7 for more

details).

CHAPTER 5

131

5.4.1 SELECTING PREDICTOR VARIABLES WITH QRT AND PRT

The stepwise procedure for selecting the best set of catchment characteristics predictors

resulted in the following equations for the LP3 mean (た), standard deviation (j), skewness

(け) and the flood quantiles (QARI) for each of the states of NSW, VIC and QLD. The

regression equations are presented in general form below, while the final results of the

equations for NSW are provided in Table 15. The final results of VIC and QLD can be seen

in Appendix B.

= 0 + 1(area) + 2(2I12) for NSW, VIC and QLD (5.10)

= 0 - 1(rain) - 2(S1085) for NSW (5.11)

= -0 - 1(area) - 2(forest) for NSW (5.12)

= 0 - 1(rain) + 2(evap) for VIC (5.13)

= -0 + 1(rain) - 2(evap) for VIC (5.14)

= 0 - 1(area) - 2(2I1) for QLD (5.15)

= -0 - 1(50

I72) + 2(rain) for QLD (5.16)

ln(QARI) = 0 + 1(area) + 2(Itc,ARI) for NSW, VIC and QLD (5.17)

Tables 16 and 17 summarise the model error variance (MEV) as expressed by its posterior

mean value, for the regional models of the three LP3 parameters and the flood quantiles Q2,

Q10 and Q100 for each of the selected combinations of catchment characteristics for NSW.

Table 15 Summary of the final BGLSR results for NSW

Posterior moment BGLSR model (NSW) Regression coefficient

Mean Standard deviation

j2 0.29 0.051

く0 (constant) 4.09 0.092 く1 (area) 0.67 0.053

Mean (µ)

く2 (2I12) 2.31 0.21

j2 0.067 0.013 Standard deviation (j)

く0 (constant) 1.25 0.12 く1 (rain) -0.61 0.11 く2 (S1085) -0.13 0.040

j2 0.0125 0.012 Skewness (け) く0 (constant) -0.42 0.072

く1 (area) -0.092 0.048 く2 (forest) -0.094 0.053

Flood quantiles j2 0.31 0.055

QARI=2 く0 (constant) 4.06 0.13 く1 (area) 1.26 0.086

CHAPTER 5

132

く2 (Itc,ARI =2) 2.42 0.24 QARI=5 j2 0.23 0.042

く0 (constant) 5.11 0.092 く1 (area) 1.19 0.072 く2 (Itc,ARI =5) 2.08 0.20

QARI=10 j2 0.23 0.045


QARI=20 j2 0.25 0.050


QARI=50 j2 0.35 0.060


QARI=100 j2 0.35 0.075


Also provided in Tables 16 and 17 is the summary of the statistical measures used i.e.

AVPO and AVPN, AIC, BIC, BPV and pseudo R2 ( 2GLSR ) to assess the best combination of

catchment characteristics to predict the three parameters and flood quantiles of the LP3

distribution. Figure 25 shows the MEV, standard error of the MEV and 2GLS

R values for the

skew model. Combination 9 with a constant and two predictor variables area and forest

showed the lowest MEV and the highest 2GLSR

as well as the lowest AIC and BIC values.

However, the lowest AVPO and AVPN values were found for combination 1 (a constant

value, representing the intercept term in the regression model - see Figure 25).

The BPV values were used to carry out a hypothesis test (at the 5% significance level) on

the predictors of combination 9. The BPVs were found to be 6% and 7% for area and

forest, respectively, while this showed the variables are not to be significant; however,

these values are not considered to be notably high. Both the posterior coefficients く1 and く2

values were smaller than two posterior standard deviations (for the respective case) away

from zero supporting the results from the BPV test that these variables are not really

significant.

In this case, it may be possible to adopt a regional average skew value for the entire NSW

state without using any prediction equation/predictor variable in the regression equation.

This finding is consistent with Gruber and Stedinger (2008) who found that a constant

CHAPTER 5

133

model for a regional skewness was the best model for a large region in the southeastern

part of the United States. This is also supported by the fact that there was only a modest

difference in the MEV values. Combination 9 and 1 however were both adopted and tested

in this study with the PRT approach.

A similar outcome was observed for the standard deviation model where the MEVs were

very similar for combinations 12 and 1 (figure not shown due to space constraint).

Combination 12 was adopted that had slope and rain as predictor variables. Indeed, AVPO,

AVPN, BIC and AIC values were the lowest for this combination. Both the posterior

coefficients く1 and く2 were well established in the regression equations being more than

two times the respective posterior standard deviation away from zero. The BPVs were 2%

indicating the relatively higher significance of these two variables.

For the mean flood, combination 6 (constant, area and I12,2) had the smallest MEV. The

posterior coefficients of く1 and く2 in this combination were at least 5 and 11 times the

respective posterior standard deviation away from zero, which shows that く1 and く2 are well

established in the prediction equation. Indeed, all the statistical criteria were found to be in

favour of combination 6.

Figure 25 Selection of predictor variables for the BGLSR model for the skew (note that 2GLSR uses the

right-hand axis)

0.0110

0.0115

0.0120

0.0125

0.0130

0.0135

0.0140

0.0145

0.0150

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Combination of Catchment Characterisitcs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7MEV Standard Error of MEV R-sqd GLS

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Combination of Catchment Characteristics

AVPO AVPN AIC BIC

CHAPTER 5

134

Figure 26 shows an example plot of the statistics used in selecting the best set of predictor

variables for the fixed region flood quantile (QRT) models. According to the MEV values,

combinations 19, 18, 20, 23, 16, 6, 4, 25 and 10 were potential sets of predictor variables

for the Q10 model. Combinations 18, 19, 20 and 23 contained 3 to 4 predictor variables

while combinations 16, 6, 4, 25 and 10 contained 2 predictor variables with similar MEVs

and 2GLSR values.

The AVPO, AVPN, AIC and BIC values all favoured combination 10, and hence this was

finally selected as the best set of predictor variables for the Q10 model which includes area

and design rainfall intensity Itc,10. Both posterior coefficients く1 and く2 were found to be 9

times the respective posterior standard deviation away from zero suggesting that these two

variables are well established in the prediction equation. Indeed, based on similar findings,

combination 10 was selected for all the flood quantile prediction equations (ARIs = 2 – 100

years). The BPVs for the regression coefficients associated with the variable area and

design rainfall intensity Itc,ARI for the QRT over all the ARIs were found to be significant

with values smaller than 0.01.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%MEV Standard Error of MEV R-sqd GLSR

CHAPTER 5

135

Figure 26 Selection of predictor variables for the BGLSR model for Q10 model (note that uses the

right-hand axis), (QRT, fixed region NSW), MEV = model error variance, AVPO = average variance of

prediction (old), AVPN = average variance of prediction (new) AIC = Akaike information criteria, BIC

= Bayesian information criteria

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25


AVPO AVPN AIC BIC

CHAPTER 5

136

Table 16 Summary of the catchment characteristics and statistical measures used in the stepwise regression for the parameters of the LP3 distribution for NSW

LP3 parameter Combination Catchment

characteristicsa

Mean Standard deviation Skewness

j2 AVPO AVPN AIC BIC BPV %

2GLSR j2 AVPO AVPN AIC BIC BPV

%

2GLSR j2 AVPO AVPN AIC BIC BPV

%

2GLSR

1 Const 0.92 0.94 0.92 1.22 1.22 0 0% 0.099 0.10 0.10 0.13 0.13 0 0% 0.0135 0.019 0.018 0.156 0.156 <0.1 0%

2 Const, area 0.69 0.71 0.68 0.76 0.78 0, 0 39% 0.098 0.10 0.10 0.13 0.13 0, 10 4% 0.0132 0.021 0.021 0.080 0.082

<0.1, 3 50%

3 Const, area, 2I1 0.36 0.38 0.35 0.34 0.35

0, 0, 0 74% 0.097 0.10 0.10 0.13 0.13

0,13, 19 6% 0.0131 0.025 0.024 0.079 0.083

<0.1, 3, 68 52%

4 Const, area, 50I1 0.34 0.36 0.34 0.38 0.40

0, 0, 0 70% 0.096 0.10 0.10 0.13 0.13

0,10, 20 6% 0.0131 0.025 0.024 0.079 0.083

<0.1, 3, 72 52%

5 Const, area, 50I12 0.30 0.31 0.29 0.32 0.34

0, 0, 0 75% 0.094 0.10 0.09 0.12 0.13

0,13, 10 8% 0.0132 0.025 0.024 0.080 0.084

<0.1, 3, 72 51%

6 Const, area, 2I12 0.28 0.30 0.28 0.31 0.32

0, 0, 0 76% 0.091 0.10 0.09 0.12 0.13

0,14, 6 10% 0.0133 0.025 0.024 0.082 0.086

<0.1, 3, 86 50%

7 Const, area, S1085

0.63 0.66 0.62 0.70 0.74

0, 0, 0.4 45% 0.091 0.10 0.09 0.12 0.13

0,29, 8 8% 0.0135 0.024 0.023 0.083 0.087

<0.1, 4, 92 49%

8 Const, area, sden

0.60 0.63 0.59 0.54 0.57

0, 0, 0.6 58% 0.099 0.10 0.10 0.13 0.14

0,14, 58 4% 0.0134 0.024 0.023 0.083 0.088

<0.1, 4, 81 49%

9 Const, area, forest

0.69 0.72 0.68 0.78 0.82

0, 0, 60 39% 0.091 0.10 0.09 0.12 0.13

0,5, 7 9% 0.0126 0.024 0.023 0.057 0.060

<0.1, 6, 7 65%

10 Const, area, evap

0.34 0.35 0.33 0.39 0.41

0, 0, 0.1 69% 0.098 0.10 0.10 0.13 0.13

0,14, 26 6% 0.0133 0.026 0.025 0.076 0.080

<0.1, 2, 49 53%

11 Const, area, rain 0.29 0.31 0.29 0.31 0.33

0, 0, 0.1 76% 0.078 0.08 0.08 0.10 0.10

0,40, 1 26% 0.0134 0.025 0.024 0.082 0.087

<0.1, 2, 87 49%

12 Const, rain, S1085

0.92 0.96 0.90 1.24 1.31

0,37, 16 2% 0.066 0.07 0.07 0.09 0.09

0,2, 1 35% 0.0140 0.025 0.025 0.148 0.156

0,74, 87 10%

13 Const, sden, S1085

0.91 0.94 0.89 1.15 1.21

0,0.8, 82 9% 0.090 0.09 0.09 0.12 0.13

0,60, 5 8% 0.0139 0.025 0.024 0.140 0.148

0,74, 51 14%

14 Const, evap, sden

0.88 0.92 0.86 1.05 1.11

0,0.1, 36 18% 0.098 0.10 0.10 0.13 0.14

0,27, 61 3% 0.0137 0.026 0.025 0.135 0.143

0,50, 38 17%

15 Const, forest 0.91 0.94 0.90 1.17 1.21 0, 3 6% 0.093 0.10 0.09 0.13 0.13 0, 11 4% 0.0127 0.021 0.020 0.078 0.080 0, 4 51%

16 Const, S1085, forest

0.91 0.95 0.89 1.18 1.24

0, 17, 2 7% 0.088 0.09 0.09 0.12 0.13

0, 7, 32 9% 0.0127 0.024 0.023 0.065 0.069

0, 17, 2 60%

aConst is a constant term. Refer to text in Chapter 4 for a full description of the catchment characteristics predictor variables.

CHAPTER 5

137

Table 17 Summary of the catchment characteristics and statistical measures used in the forward stepwise regression for the flood quantiles of the LP3

distribution (ARIs = 2, 10 and 100 years) for NSW

LP3 flood quantiles Combinatio

n

Catchment characteristic

sa

ARI = 2 ARI = 10 ARI = 100

j2 AVPO

AVPN

AIC

BIC

BPV %

2GLSR

j2 AVPO

AVPN

AIC

BIC

BPV %

2GLSR

j2 AVPO

AVPN

AIC

BIC

BPV %

2GLSR

1 Const 0.94 0.96 0.94

1.26

1.26

0 0%

0.89 0.91 0.89

1.16

1.16

0 0% 0.87 0.89 0.87

1.21

1.21

0 0%

2 Const, area 0.73 0.75 0.72

0.78

0.81

0, 0, 0 39%

0.54 0.56 0.53

0.52

0.53

0, 0, 0 56% 0.52 0.54 0.52

0.64

0.66

0, 0, 0 48%

3 Const, area, 2I1 0.35 0.37 0.34

0.38

0.40

0, 0, 0 71%

0.23 0.25 0.24

0.26

0.28

0, 0, 0 78% 0.35 0.38 0.36

0.42

0.45

0, 0, 0 67%

4 Const, area,

2I12

0.31 0.33 0.31

0.33

0.35

0, 0, 0 75%

0.23 0.24 0.23

0.26

0.27

0, 0, 0 78% 0.35 0.37 0.35

0.36

0.38

0, 0, 0 72%

5 Const, area,

50I1

0.34 0.36 0.34

0.36

0.38

0, 0, 0 73%

0.25 0.27 0.25

0.28

0.29

0, 0, 0 77% 0.35 0.38 0.36

0.42

0.44

0, 0, 0 67%

6 Const, area,

50I12

0.31 0.33 0.31

0.33

0.35

0, 0, 0 75%

0.22 0.24 0.23

0.25

0.27

0, 0, 0 79% 0.35 0.38 0.36

0.41

0.43

0, 0, 0 68%

7 Const, area,

S1085

0.74 0.77 0.73

0.80

0.85

0, 0, 69 39%

0.54 0.57 0.53

0.52

0.55

0, 0, 34 56% 0.52 0.55 0.52

0.65

0.69

0, 0, 63 48%

8 Const, area,

sden

0.66 0.69 0.65

0.72

0.76

0, 0, 0.3 45%

0.46 0.49 0.46

0.55

0.58

0, 0, 0.2

55% 0.49 0.52 0.49 0.63

0.66

0, 0, 0.5 50%

9 Const, area,

sden, forest

0.65 0.68 0.63

0.72

0.78

0, 0, 1, 9 46%

0.48 0.51 0.47

0.56

0.61

0, 0, 1, 90 54% 0.49 0.52 0.48

0.63

0.69

0, 0, 1, 20 51%

10 Const, area,

Itc,ARI

0.29 0.33 0.31

0.33

0.35

0, 0, 0 75%

0.23 0.24 0.23

0.26

0.27

0, 0,

0 79% 0.35 0.38 0.36 0.44

0.46

0, 0, 0 65%

11 Const, area,

forest

0.69 0.72 0.67

0.76

0.80

0, 0, 2 42%

0.54 0.57 0.54

0.51

0.54

0, 0, 40 57% 0.53 0.56 0.52

0.65

0.69

0, 0, 59 48%

CHAPTER 5

138

12 Const, area,

evap

0.61 0.64 0.60

0.65

0.69

0, 0, 0.2 50%

0.38 0.40 0.38

0.38

0.40

0, 0, 0

69% 0.45 0.48 0.45 0.59

0.63

0, 0, 0.4 53%

13 Const, area,

rain

0.34 0.36 0.34

0.36

0.38

0, 0, 0.2 73%

0.35 0.37 0.35

0.43

0.45

0, 0, 0

64% 0.40 0.43 0.41 0.50

0.53

0, 0, 0.1 61%

14 Const, rain,

S1085

0.90 0.94 0.88

1.06

1.12

0, 0, 4 19%

0.86 0.90 0.85

1.07

1.12

0, 6,

1 11% 0.85 0.89 0.83 1.17

1.23

0, 36, 0.7 8%

15 Const, sden,

S1085

0.93 0.97 0.91

1.21

1.28

0, 15, 2 8%

0.88 0.91 0.86

1.10

1.16

0,

25,0.1 9% 0.85 0.89 0.84

1.16

1.22

0, 27,0.

1 8%

16 Const, area,

50I12, S1085

0.37 0.39 0.36

0.23

0.25

0, 0, 0, 40 83%

0.22 0.24 0.22

0.26

0.28

0, 0, 0, 35 79% 0.35 0.38 0.35

0.42

0.46

0, 0, 0, 62 67%

17 Const, area,

50I12, rain

0.29 0.31 0.29

0.32

0.35

0, 0, 0, 0.4 76%

0.23 0.25 0.23

0.26

0.28

0, 0, 0, 22 79% 0.35 0.38 0.35

0.42

0.46

0, 0, 0, 28 67%

18 Const, area,

50I12, S1085,

forest

0.37 0.39 0.36

0.25

0.28

0, 0, 0, 48, 79 72%

0.21 0.24 0.22

0.25

0.28

0, 0, 0, 55,

75

80% 0.35 0.38 0.35 0.33

0.37

0, 0, 0, 55,

79 75%

19 Const, area,

50I12, Itc,ARI,

forest

0.37 0.39 0.35

0.22

0.25

0, 0, ,15, 16,7

0 74% 0.21 0.24 0.21

0.25

0.28

0, 0, ,22,

43,70

80% 0.34 0.38 0.35 0.33

0.37

0, 0, ,10,

80,90 75%

20 Const, area,

50I12, Itc,ARI,

S1085, forest

0.37 0.40 0.35

0.24

0.28

0, 0, 15, 18, 70,7

8 73% 0.22 0.24 0.22

0.26

0.30

0, 0, 23, 44,

95,90 80% 0.35 0.39 0.35

0.36

0.42

0, 0, 27, 90,

95,90 73%

21 Const, area,

Itc,ARI, rain

0.30 0.32 0.29

0.32

0.35

0, 0, 0, 2 76%

0.23 0.25 0.23

0.26

0.29

0, 0, 0, 76 78% 0.35 0.38 0.35

0.44

0.48

0, 0, 0, 81 66%

22 Const, area,

Itc,ARI, evap

0.32 0.34 0.31

0.34

0.37

0, 0, 0, 86 74%

0.23 0.25 0.23

0.26

0.29

0, 0, 0, 80 79% 0.35 0.39 0.36

0.45

0.49

0, 0, 0, 95 65%

23 Const, area,

Itc,ARI, forest

0.37 0.39 0.36

0.23

0.25

0, 0, 0, 98 73%

0.22 0.24 0.22

0.25

0.27

0, 0, 0, 8 79% 0.35 0.38 0.35

0.40

0.43

0, 0, 0, 98 69%

CHAPTER 5

139

24 Const, area,

Itc,ARI, S1085

0.37 0.39 0.36

0.23

0.25

0, 0, 0, 92 73%

0.23 0.25 0.23

0.26

0.29

0, 0, 0, 50 79% 0.35 0.38 0.35

0.45

0.49

0, 0, 0, 95 65%

25 Const, area,

2I1, Itc,ARI

0.32 0.34 0.31

0.35

0.38

0, 0, 46, 0 74%

0.23 0.25 0.23

0.26

0.28

0, 0, 59, 1 79% 0.35 0.38 0.35

0.43

0.47

0, 0, 49, 0 67%

aConst is a constant term. Refer to text in Chapter 4 for a full description of the catchment characteristics predictor variables.

CHAPTER 5

140

5.5 REGION OF INFLUENCE VS. FIXED REGIONS FOR PARAMETER

AND QUANTILE REGRESSION TECHNIQUES

5.5.1 REGRESSION DIAGNOSTICS – PSEUDO ANALYSIS OF VARIANCE

The pseudo analysis of variance (ANOVA) tables for the Q20 model and the parameters of

the LP3 distribution (mean and skew are shown only due to space constraint) are presented in

Tables 18 to 20 for the fixed and ROI regions for NSW, VIC and QLD. The pseudo ANOVA

table describes how the total variation among the iy values (predicted values) can be

apportioned between that explained by the model error and sampling error. This is an

extension of the ANOVA in the OLSR which does not recognise and correct for the expected

sampling variance (Reis et al., 2005). An error variance ratio (EVR) is used in Pseudo

ANOVA, which is the ratio of sampling error variance to model error variance. An EVR of

greater than 0.20 may indicate that the sampling variance is not negligible when compared to

the model error variance, which suggests the need for a GLSR analysis (Gruber et al., 2007).

For the LP3 parameters, the sampling error (i.e. EVR) increases as the order of moment

increases, this can be clearly seen for all the three states in Tables 18 and 19. For example,

for NSW the EVR for the mean flood model for ROI is 0.3 (i.e. the sampling error is only 0.3

times of the model error) (see Table 18), the corresponding EVR value for the skew model

(Table 19) is 18 (i.e. the sampling error is 18 times of the model error). The ROI shows a

reduced model error variance for all the three states (i.e. a reduced heterogeneity), in

particular for the mean flood model, as compared to the fixed regions. For example, for NSW

state (Table 18) the model error variances for the fixed region and ROI are 27.7 and 16.5,

respectively. It was found that the model error dominated the regional analysis for the mean

flood and the standard deviation models (results not shown) for both the fixed regions and

ROI for all the states.

For the ROI, the mean flood model also shows a much higher model error variance than

those of the standard deviation and skew models. These results based on the model error

variance alone indicate that the mean flood has the greater level of heterogeneity associated

with its regionalisation as compared to the standard deviation and skew. The ROI, however

shows a higher EVR than the fixed regions e.g. for the mean flood model for NSW, the EVR

is 0.30 for the ROI and 0.17 for the fixed region (see Table 18), Table 18 also provides the

CHAPTER 5

141

EVR results for VIC and QLD states, which show a similar outcome as of NSW. For the

standard deviation model for NSW the EVR is 0.77 for the ROI and 0.35 for the fixed region,

again similar results were found for VIC and QLD states as of NSW.

The EVR values for the skew models of NSW, VIC and QLD are shown in Table 18. It can

be observed from Table 18 that the EVR values range from 8 to 19 and 9.5 to 19 for the fixed

regions and ROI, respectively (Table 19), which are much higher than the recommended

limit of 0.20. In this relation, two important points may be noted below:

(i) This result clearly indicates that the GLSR is the preferred modeling option over

the OLSR for the skew model. An OLSR model for the skew would have clearly

given misleading results as it does not distinguish between the model and

sampling errors as found in similar previous studies (e.g. Reis et al., 2005 and

Haddad et al., 2010b).

(ii) Importantly, what is clear is that if a method of moment estimator was used to

estimate the model error variance ( 2 ) for the skew model, the model error

variance would have been grossly underestimated as the sampling error heavily

dominated the regional analysis. A more reasonable estimate of the model error

variance has been achieved with the Bayesian procedure as it represents the

values of 2 by computing expectations over the entire posterior distribution.

Similar results were found by Reis et al. (2005), Gruber and Stedinger (2008) and

Haddad et al. (2010b). As far as the ROI approach is concerned there is little

change in the EVR values as compared to the fixed region approach for all the

three states as the skew model tends to include more stations in the regional

analysis.

Table 18 Pseudo ANOVA table for the mean flood model (PRT, fixed region and ROI,

NSW, VIC and QLD states) (Here n = number of sites in the region, k = number of predictors

in the regression equation, EVR = error variance ratio, 2

0 = model error variance when no

predictor variable is used in the regression model, 2 = model error variance when predictor

variable is used in the regression model and )]ˆ([ ytr = sum of the diagonals of the sampling

covariance matrix)

CHAPTER 5

142

Table 19 Pseudo ANOVA table for the skew model (PRT, fixed region and ROI, NSW, VIC

and QLD states) (variables are explained in Table 18 caption)


NSW Fixed region ROI

Fixed region ROI

Model k=3 k=3 n )( 22

0 61.5 61.2

Model error n-k-1=92 n-k-1=32 n )( 2 27.7 16.5

Sampling error n = 96 n = 36 )]ˆ([ ytr 5 4.5

Total 2n-1 = 191 2n-1 = 71 Sum of the above

94 83

EVR 0.17 0.3

VIC

Model k=3 k=3 46 45

Model error n-k-1=127 n-k-1=39 37.5 28

Sampling error n = 131 n = 43 6.1 6

Total 2n-1 = 261 2n-1 = 85 Sum of the above 90 79

EVR 0.16 0.2

QLD

Model k=3 k=3 105 102

Model error n-k-1=168 n-k-1=34 39 22

Sampling error n = 172 n = 38 10.2 9


EVR 0.26 0.40



Fixed region ROI

Model k=3 k=3 n )( 22

0 0.1 0.1


Sampling error n = 96 n = 95 )]ˆ([ ytr 24 23

Total 2n-1 = 191 2n-1 = 189

Sum of the above 25 23

EVR 19 18

VIC

Model k=3 k=3 6.5 7.3

Model error n-k-1=127 n-k-1=113 4.5 3.7

Sampling error n = 131 n = 117 38 35

Total 2n-1 = 261 2n-1 = 233


EVR 8.4 9.5

QLD

Model k=3 k=3 0.11 0.65

Model error n-k-1=168 n-k-1=146 2.6 2.1



EVR 17 19

CHAPTER 5

143

The pseudo ANOVA tables were also prepared for all the flood quantile models (i.e. QRT

models). The results for the Q20 for all the three states are shown in Table 20. Here the ROI

shows a higher EVR values than that of the fixed region. Also, the sampling error generally

increases with increasing ARIs. The reduction in the model error variance as seen in Table 20

for all the three states is due to the fact that ROI has found an optimum number of sites based

on the minimum model error variance which generally uses fewer sites than that of the fixed

region approach. This indeed suggests that sub regions may exist in larger state.

The flood quantile Q2 was found to experience the lowest EVR values for NSW and QLD for

both the fixed region and ROI as compared to the Q20 and Q100 model results. This reflects

the much greater spatial variability of the mean which is dominated by local catchment

factors (as compared to the higher moments). This is reflected in the Q2 flood as it is very

close to the mean flood magnitude. The Q20 shows an EVR of 0.43, 0.30 and 0.97,

respectively for NSW, VIC and QLD states (see Table 20) for ROI approach, which suggests

that the BGLSR combined with ROI should be the preferred option when modelling the

larger ARI quantiles, even though in this particular case the ROI has been impacted by the

relatively large model error variances that have dominated the regional flood quantile

modelling results.

Table 20 Pseudo ANOVA table for Q20 model (QRT, fixed region and ROI for NSW, VIC

and QLD states) (variables are explained in Table 18 caption)



Fixed region ROI

Model k=3 k=3 n )( 22

0 61.1 61.1


Sampling error n = 96 n = 52 )]ˆ([ ytr 7.6 7.0

Total 2n-1 = 191 2n-1 = 103


EVR 0.32 0.43

VIC

Model k=3 k=3 45.2 45.2

Model error n-k-1=127 n-k-1=48 55.2 24.4

Sampling error n = 131 n = 52 7.4 7.2

Total 2n-1 = 261 2n-1 = 103


EVR 0.13 0.30

QLD

CHAPTER 5

144

5.5.2 REGRESSION DIAGNOSTICS – MODEL ADEQUACY AND OUTLIER

ANANLYSIS

To assess the underlying model assumptions (i.e. the normality of residuals), the plots of the

standardised residuals [Equation (3.42)] vs. fitted quantiles were examined for all the flood

quantiles (estimated from QRT and PRT) and the parameters of the LP3 distribution for all

the three states. The predicted values were obtained from the LOO validation procedure.

Figure 27 shows the plot for the Q20 model for the state of NSW.

-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

2 3 4 5 6 7 8Fitted ln(Q 20)

Sta

ndar

dise

d R

esid

ual


-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

2 3 4 5 6 7 8 9

Fitted ln(Q 20)

Sta

ndar

dise

d R

esid

ual


Figure 27 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT,

fixed region and ROI, NSW)

Model k=3 k=3 59 46

Model error n-k-1=168 n-k-1=77 25 12



EVR 0.53 0.97

CHAPTER 5

145

If the underlying model assumption is satisfied to a large extent the standardised residual

values should not exceed the ± 2 limits; in practice, 95% of the standardised residuals should

fall between ± 2. The result in Figure 27 reveals that the developed flood quantiles from the

prediction equations satisfy the normality of residual assumption quite satisfactorily for both

the fixed and ROI approaches. Also no specific pattern (heteroscedasicity) can be identified

with the standardised values, which are being almost equally distributed below and above

zero. What is noteworthy is that ROI clearly provides fewer genuine outliers for both the

quantiles estimated by the QRT and PRT methods than the fixed region approach. This

indeed demonstrates the superiority of the ROI over the fixed region approach. Similar

results were observed for the states of VIC and QLD. The figures associated with VIC and

QLD can be seen in Appendix B.

The QQ-plots of the standardised residuals [Equation (3.42)] vs. normal score [Equation

(3.43)] for the fixed region (based on LOO validation) and ROI were then examined. The

results for the Q20 model for NSW are shown in Figure 28, which reveals that all the points

closely follow a straight line; this is especially noticeable for the ROI approach for both the

QRT and PRT methods. This indicates that the assumption of normality and the homogeneity

of variance of the standardised residuals are better approximated with the ROI approach.

Overall, no genuine outliers can be detected for the flood quantiles estimated by the QRT and

PRT on a regional scale.


-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT BGLSR-PRT

CHAPTER 5

146

ARI 20 (ROI)

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT BGLSR-PRT


region, ROI, NSW)

If the standardised residuals are indeed normally and independently distributed N(0, 1) with

mean 0 and variance 1 then the slope of the best fit line in the QQ-plot, which can be

interpreted as the standard deviation of the normal score (Z score) of the quantile, should

approach 1 and the intercept, which is the mean of the normal score of the quantile should

approach 0 as the number of sites increases. Figure 28 indeed shows that the fitted lines for

the developed models pass approximately through the origin (0, 0) and have a slope

approximately equal to one. It can be seen that the results of the ROI approach satisfy the

model assumptions relatively better than the fixed region approach. The superiority of the

ROI approach again here is demonstrated. Similar results were observed for VIC and QLD

states. The figures associated with VIC and QLD can be seen in Appendix B. The

assumption of the normality of the residuals for all the three states (NSW, VIC and QLD)

could not be rejected at the 10% level of significance using the Anderson-Darling and

Kolmogorov-Smirnov tests for normality.

Below is presented the residual analysis results of the ROI method for the PRT using a

weighted regional average standard deviation and skew values, which are weighted by the

error covariance matrix (i.e. no predictor variables in the regression equation considered in

this case) for the state of NSW (as an example). The main aspect of this analysis is to

determine if there is any reasonable loss in accuracy and efficiency especially in the flood

quantile estimation of the mid to higher ARIs (i.e. 20 to 100 years) when using a weighted

regional average standard deviation and skew (obtained as above) as compared to ones with

CHAPTER 5

147

predictor variables. It should be stressed here that this weighted regional average standard

deviation and skew do vary from site to site as each site has a unique ROI.

The standardised residuals vs. the fitted quantile plot of Q20 is shown in Figure 29 that

superimposes the estimate made by the QRT-ROI, PRT-ROI and the PRT-ROI that uses a

weighted regional average standard deviation and skew estimate. Indeed, one can observe

that the PRT-ROI estimate of Q20 with the weighted regional average standard deviation and

skew performs equally well as the competing models. Nearly all the standardised residuals

fall within the 2 limits, suggesting that the use of predictor variables in the estimation of

standard deviation and skew does not really add much meaningful information to the

analysis. The QQ-plot (Figure 30) of the competing models shows that the use of a weighted

regional average standard deviation and skew does not result in any major gross errors in the

final quantile estimates. The residual analysis also reveals that the major assumptions of the

regression have been largely satisfied (i.e. normality of the residuals). The results based on

the evaluation statistics are given in section 5.5.4.

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

2 3 4 5 6 7 8 9

Fitted ln(Q 20)

Sta

ndar

dise

d R

esid

ual

BGLSR-QRT (ROI)

BGLSR-PRT (ROI)

BGLSR-PRT (ROI, Regional weighted average Stdev and Skew)

Figure 29 Plots of the standardised residuals vs. predicted values for ARI of 20 years (QRT and PRT,

ROI and PRT-ROI with weighted average standard deviation and skew, NSW)

CHAPTER 5

148

ARI 20 (ROI)

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT BGLSR-PRT BGLSR-PRT (ROI, regional weighted ave, stdev and skew)

Figure 30 QQ-plot of the standardised residuals vs. Z score for ARI of 20 years (QRT and PRT, ROI, and

PRT ROI with weighted average standard deviation and skew, NSW)

5.5.3 DIAGNOSTIC STATISTICS

The summary of the various regression diagnostics (as described in section 3.8 and Equation

(3.41)) is provided in Table 21 for NSW, VIC and QLD states. This shows that for the mean

flood model (for all the three states), MEV and SEP are much higher than those of the

standard deviation and skew models. This indicates that the mean flood model exhibits a

higher degree of heterogeneity than the standard deviation and skew models. This result

supports the pseudo ANOVA results. Indeed the issue here is that sampling error becomes

larger as the order of the moment increases, therefore in case of the skew the spatial variation

is a second order effect (as compared to sampling variability) that it not really detectable, this

is apparent in both the fixed region and ROI cases.

CHAPTER 5

149

Table 21 Regression diagnostics for the fixed region and ROI for NSW, VIC and QLD

Model Fixed region ROI

NSW

MEV AVP SEP (%) 2GLSR (%) MEV AVP SEP (%) 2

GLSR (%)

Mean 0.29 0.31 60 76 0.19 0.23 51 84

Stdev 0.058 0.062 25 37 0.046 0.054 23 46

Skew 0.013 0.024 16 65 0.013 0.023 16 65

Q2 0.31 0.33 63 77 0.20 0.24 52 84 Q5 0.23 0.24 52 79 0.16 0.20 47 85 Q10 0.23 0.24 52 79 0.16 0.20 46 85 Q20 0.25 0.27 55 76 0.18 0.22 49 83 Q50 0.35 0.37 66 70 0.25 0.28 56 74 Q100 0.35 0.38 68 65 0.29 0.34 63 70

VIC

Mean

0.29 0.31 60 62 0.21 0.23 46 63 Stdev

0.044 0.049 22 65 0.041 0.050 21 65 Skew

0.034 0.040 20 70 0.028 0.037 19 73 Q2 0.27 0.28 57 63 0.20 0.23 51 65 Q5 0.29 0.31 60 61 0.20 0.23 50 64 Q10 0.35 0.37 67 57 0.23 0.26 54 61 Q20 0.35 0.37 67 57 0.19 0.22 48 66 Q50 0.47 0.49 80 49 0.27 0.32 61 61 Q100 0.59 0.60 91 45 0.29 0.35 64 54

QLD

Mean

0.23 0.24 52 77 0.14 0.15 40 78 Stdev

0.13 0.14 38 34 0.056 0.061 24 46 Skew

0.015 0.024 16 44 0.014 0.026 16 44 Q2 0.26 0.27 56 75 0.15 0.18 43 79 Q5 0.17 0.18 44 79 0.08 0.11 34 83 Q10 0.18 0.19 45 74 0.07 0.11 33 79 Q20 0.15 0.16 41 77 0.07 0.13 36 80 Q50 0.17 0.19 45 72 0.10 0.14 39 77 Q100 0.20 0.22 49 72 0.12 0.16 40 73

For the mean flood model (all the three states), the ROI shows a MEV which is smaller than

the fixed region analysis. The lower MEV in turn also provides the lower AVP values as can

be seen in Table 21. Also, the 2GLSR

values for the mean flood model (all the three states) with

the ROI case are 8%, 1% and 1% higher than the fixed region for NSW, VIC and QLD,

CHAPTER 5

150

respectively. These results indicate that the ROI should be preferred over the fixed region for

developing the mean flood model.

For the standard deviation model, ROI shows 2% smaller and 9% higher SEP and 2GLSR

values, respectively for NSW. The best result is found for QLD, here ROI shows a 14%

smaller and 12% higher SEP and 2GLSR values, respectively. This indicates that the ROI is

preferable than the fixed region for the standard deviation model. The SEP and 2GLSR

values

for the skew model are the same for the fixed region and ROI for NSW and QLD,

respectively (see Table 21). This can be explained by the fact that the number of sites for the

skew model in the ROI approach was very close to that of the fixed region approach.

Interestingly, one can see from Table 21 that the SEP values for all the flood quanitle models

for NSW, VIC and QLD respectively are 5% to 11%, 6% to 27% and 5% to 13% smaller for

the ROI case than the fixed region one. Also, the 2GLSR

values for ROI case for NSW, VIC

and QLD respectively are 4% to 7%, 2% to 12% and 1% to 5% higher than the fixed region

case. These results show the relative advantage of the ROI approach coupled with BGLSR

over a fixed region BGLSR where further improvements have been achieved overall.

Table 22 shows the number of sites in a region, the associated MEVs and their percentage

(%) differences for the ROI against the fixed region models for NSW, VIC and QLD. This

shows that the ROI mean flood model for all the three states has fewer sites on average (36

out of 96 i.e. 37% of the available sites for NSW, 32% for VIC and 24% for QLD) than the

standard deviation and skew models. The ROI skew model for each state has the highest

number of sites which includes nearly all the sites in the respective states. The MEVs for all

the flood quantile ROI models are smaller than those of the fixed region models with

differences in order of 50% to 60%. This shows that the fixed region models experience a

greater heterogeneity than the ROI. If the fixed region models are made too big, the model

error is likely to be inflated by heterogeneity unaccounted for by the catchment

characteristics predictor variables. Two important points should be noted here that spatial

proximity (physical distance) may become a surrogate for unknown processes in regional

flood frequency analysis (RFFA) and that the catchment characteristics variables available at

the regional scale may not always be sufficient indicators of regional flood behaviour. In fact,

these regional models are too simplistic in their form, predictor variables and data

CHAPTER 5

151

representation; there are lots of lumping and approximations involved along with many

simplistic assumptions. Hence, regional flood models can never be highly accurate within the

current modelling and data regime.

Table 22 Model error variances associated with the fixed region and ROI for NSW, VIC

and QLD (n = number of sites needed for the LP3 parameters and flood quantiles)

State Parameter / ARI

Mean Stdev Skew Q2 Q5 Q10 Q20 Q50 Q100

ROI (n) / 2ˆ

36 0.19

47 0.046

95 0.013

31 0.20

42 0.16

48 0.16

52 0.18

53 0.25

55 0.29

NSW Fixed region (n)

/ 2ˆ 96

0.29 96

0.058 96

0.013 96

0.21 96

0.23 96

0.23 96

0.25 96

0.35 96

0.35

(%) diff in 2ˆ 34% 21% 0% 5% 30% 30% 28% 29% 17%

ROI (n) / 2ˆ

43 0.21

83 0.041

117 0.028

41 0.20

45 0.20

52 0.23

52 0.19

57 0.27

57 0.29

VIC Fixed region (n) /

2ˆ 131 0.29

131 0.044

131 0.034

131 0.27

131 0.29

131 0.35

131 0.35

131 0.47

131 0.59

(%) diff in 2ˆ 28% 7% 18% 26% 31% 34% 46% 43% 51%

ROI (n) / 2ˆ

42 0.15

65 0.056

150 0.014

60 0.14

65 0.08

74 0.07

80 0.07

88 0.10

90 0.12

QLD Fixed region (n) /

2ˆ 172 0.23

172 0.14

172 0.015

172 0.26

172 0.17

172 0.18

172 0.15

172 0.17

172 0.20

(%) diff in 2ˆ 35% 60% 7% 46% 53% 61% 53% 41% 40%

Figure 31 plots the spatial variation of the MEVs (grouped in classes according to numerical

values as specified in the legend) for the mean flood model (Figure 31a) and how the MEV

varies with the number of sites within the ROI, for a typical site (Figure 31b) for the state of

NSW. The plot reveals the relative advantage of the ROI approach. It can be seen that there

are distinct spatial variations illustrating the heterogeneity of the mean flood model that

would be often ignored in a fixed region approach. Similar results were observed in both VIC

and QLD states.

The spatial variation in the model error for the skew model captures the entire study area

mostly (figure not shown) for NSW, VIC and QLD. Similar results were found by

CHAPTER 5

152

Hackelbusch et al. (2009). The significance of this finding is that if any spatial variations

exist in the hydrologic statistic of interest, they are most likely to be captured by the ROI.

Figure 31 Spatial variations of the grouped minimum model error variances for (a) mean flood model

and (b) number of sites which produced the lowest predictive variance for the mean flood model

5.5.4 EVALUATION STATISTICS

An objective assessment of the developed models can be made by using the numerical

evaluation statistics given in Equation (3.45) and Equation (3.44), in which RMSEr is the

relative root mean squared error and REr is the absolute median relative error. The RMSEr is

associated with the predictive error variance, where as REr is related mostly with prediction

bias. Using the model predicted flood quantiles (estimated by QRT and PRT, with fixed and

ROI regions) using the LOO validation, the evaluation statistics were calculated. These are

given in Table 23.

Numerical values of these statistics show the relative advantage of the ROI approach (for

both the QRT and PRT) for all the three states (i.e. NSW, VIC and QLD). The flood quantile

estimates obtained from the fixed regions (QRT and PRT) are more biased (i.e. higher REr)

and are of a lesser accuracy (i.e. higher RMSEr). This is observed for all the three states.

0

45

9050

3540

SITES15202530

100

kilometres

200

Victoria

Australian Capital Territory

New South Wales

kilometres

0 100

MEV = 0.12 - 0.16MEV = 0 - 0.11

MEV = 0.17 - 0.19MEV = 0.20 - 0.21MEV => 0.24

LEGEND

200

Victoria

Australian Capital Territory

New South Wales

(a) (b)

CHAPTER 5

153

Table 23 Evaluation statistics (RMSEr and REr) from LOO validation for NSW (Results

NSW for PRT using the weighted regional average standard deviation and skew models, i.e.

no predictor variables given in brackets), VIC and QLD

NSW

Model RMSEr (%) REr (%) PRT QRT PRT QRT Fixed

region ROI Fixed

region ROI Fixed

region ROI Fixed

region ROI

Q2 73 62 (63)

68 59 46 38 (37)

44 40

Q5 65 54 (59)

70 59 37 30 (32)

38 36

Q10 67 56 (60)

74 55 37 29 (33)

37 36

Q20 72 57 (63)

83 53 36 34 (34)

35 31

Q50 81 70 (77)

100 67 38 34 (35)

36 32

Q100 90 75 (85)

100 72 40 36 (39)

38 35

VIC Q2 56 55 77 68 38 37 37 37 Q5 69 68 87 68 38 36 35 35 Q10 82 80 107 69 37 37 36 35 Q20 96 92 112 74 41 40 38 33 Q50 115 110 113 95 41 40 41 40 Q100 130 127 140 120 46 45 44 44

QLD Q2 82 69 61 56 39 35 39 39 Q5 68 60 48 44 33 34 34 32 Q10 69 60 52 47 34 30 32 31 Q20 72 65 50 44 35 33 31 29 Q50 78 68 53 49 37 36 32 31 Q100 85 79 58 53 41 40 36 31

For the QRT and PRT (fixed region) it can be observed from Table 23 that there is not much

difference in accuracy (RMSEr) for NSW, VIC and QLD states. Indeed, in relation to bias

(REr) both QRT and PRT fixed region models were found to be very similar for the three

states.

For QRT and PRT (ROI region), a similar result was found where there was no notable

difference in accuracy (RMSEr) between the competing models. For the bias (REr), both the

QRT ROI and PRT ROI models achieved very similar values as seen in Table 23. While

Table 23 does show slightly better accuracy and bias for the QRT over PRT, a point needs to

CHAPTER 5

154

be bought out to clarify this result. There is some underlying bias involved with the

validation of the QRT (fixed and ROI) in that the predicted quantiles are being compared to

the quantiles used in the regression analysis as dependent variables. Thus the result mostly

seems to be slightly in favour of the QRT (see Table 23). How to compensate for this bias in

the validation process needs further effort, which has not been done in this thesis. On the

other hand, the validation procedure for the PRT is more stringent in that the parameters of

the distribution are used in the regression and quantiles are then independently estimated and

compared to the at-site flood quantiles. The results from the evaluation statistics therefore

indicate that the PRT is indeed a viable approach for RFFA as an alternative to the

commonly applied QRT method in the ungauged catchment application.

Below the results are presented based on the evaluation statistics (i.e. Equations (3.45 and

3.44)) to compare the flood quantiles from PRT-ROI using a weighted regional average

standard deviation and skew to the PRT-ROI using a standard deviation and skew as a

function of predictor variables for the state of NSW. The evaluation statistics (see Table 23 –

values in the bracket) from the validation reveal that there is no real loss of accuracy (as

compared to at-site flood quantiles) if a weighted regional average standard deviation and

skew model is adopted to estimate the flood quantiles up to the 20 years ARI.

The results at the higher ARIs (50 and 100 years) show that using a weighted regional

average standard deviation and skew may slightly affect the outcome of the analysis (i.e.

lesser accuracy and greater bias). The larger ARI estimation may require further information

which may be provided by having predictor variables (such as catchment area, design rainfall

intensity, forest and mean annual rainfall) for the standard deviation model as found in this

study. This issue deserves further investigation before estimating larger ARI flood quantiles

based on a weighted average standard deviation and skew estimates that do not use any

predictor variables.

The evaluation statistics presented above related to a particular aspect of the model validation

over all the six ARIs for all the three states. Now it is worth looking at the overall

performances of the different models (QRT and PRT, with fixed and ROI regions) based on a

ratio statistics and ‘case score analysis’. The ratio is defined as Qpred/Qobs (i.e. rr) and gives

an indication of the degree of bias (i.e. systematic over- or under estimation), where a value

of 1 indicates good ‘average’ agreement between the Qpred and Qobs. Here Qpred values were

CHAPTER 5

155

obtained from LOO validation (fixed and ROI) using the developed QRT or PRT model. The

distributions of the Qpred/Qobs ratio values for the state of NSW are shown in Figure 32 for 5,

20 and 100 years ARIs. Here, for the 5 years ARI, PRT-ROI shows the best results as the

median ratio is the closest to the line corresponding to Qpred/Qobs = 1 (1-line) and the overall

spread of the ratio values is the smallest. For the 20 years ARI, QRT-ROI median ratio is

closer to the 1-line as compared to the PRT-ROI case; however, the overall spread of the

ratio values for both the QRT-ROI and PRT-ROI is very similar. For the 100 years ARI,

QRT-ROI shows noticeable overestimation and PRT-ROI shows some underestimation as

the median ratio value is located just below the 1-line.

Figure 32 Boxplots of Qpred/Qobs ratios for NSW for QRT and PRT, with fixed and ROI regions

Considering all the three states, a case score analysis of the Qpred/Qobs ratio values is

presented below. The criteria for the case score analysis can be seen in Chapter 3, section 3.9.

The models are assessed based on which one receives the most desirable estimation on

average over all the cases (i.e. 6 ARIs and 399 catchments (in total 2394 cases for each PRT

and QRT), combining NSW, VIC and QLD). Based on the criteria set out in section 3.9,

from the 2394 cases, the QRT and PRT with fixed region produce 1881 and 1829 cases

respectively with a ‘desirable estimation’, which is equivalent to 78% and 76% of the cases

respectively. The QRT and PRT fixed region show that 11% and 13% of cases respectively

have a ‘gross underestimation’. The ‘gross overestimation’ for QRT and PRT fixed region

achieves 11% of the cases each.

CHAPTER 5

156

The QRT-ROI and PRT-ROI methods provide 83% and 80% of cases with a ‘desirable

estimation. The ‘gross underestimation’ is associated with 9% of cases for both the QRT and

PRT, respectively. The ‘gross overestimation’ sites for QRT-ROI and PRT-ROI are 8% and

11% of the cases, respectively. It can be seen that in both the fixed and ROI regions there are

cases where the results do not have a very high degree of accuracy. Such results are typical of

RFFA methods (see Rahman, 2005) and are somewhat as expected due to simplistic nature of

RFFA models, which involve many simplified assumptions. For example, addition of a

greater number of predictor variables and/or use of a complex model form may increase

accuracy marginally, but they are not generally significant as far as practical application of

the RFFA methods is concerned (e.g. see Rahman et al., 1999a). Also, the error in at-site

flood frequency analysis estimates (which is the base case for comparison) needs to be kept

in perspective. While we see improvements in the ROI approach for QRT and PRT, the fact

is that there remain a few cases where estimations are not of high accuracy. This needs

further investigation to identify the reason for such high degree of error, which however, has

not been done in this thesis. On average, however, only modest differences can be found for

the QRT-ROI and PRT-ROI estimates for the majority of the cases (see Table 23).

In looking at the cases where most of the ‘gross overestimation’ and ‘gross underestimation’

happened, it was found that the PRT in some cases under estimated the at-site flood quantles

for the larger ARIs (50 and 100 years). Interestingly, it was also found that the QRT

overestimated in many cases the lower ARI (2 and 5 years) at-site flood quantile. These

results were found for a range of catchments sizes over all the states.

What can be concluded overall from this evaluation is that the PRT does not provide less

accurate estimates than the commonly applied QRT method. In fact, the PRT is a useful way

to check the results from QRT to make sure estimates make sense, especially in the case

where the QRT results may not increase smoothly with ARI.

5.6 SECTION SUMMARY

The main objectives of sections 5.4 and 5.5 were to compare the BGLSR approaches using a

fixed and ROI framework that seeks to minimise the Bayesian model error variance

(predictive uncertainty). For this purpose, data from 452 small to medium sized catchments

in eastern Australia (covering Tasmania, VIC, NSW and QLD states) were used. Prediction

equations were developed for the flood quantiles of ARIs of 2 to 100 years using the QRT

CHAPTER 5

157

and for the first three moments of the LP3 distribution (i.e. PRT). Using a method similar to

forward stepwise regression and adopting a number of statistical selection criteria it was

possible to identify the optimal regression models to use in the ROI approach.

It was found that area and design rainfall intensity were significant predictors for the

estimation of the flood quantiles in these states using QRT, while area, design rainfall

intensity, mean annual evaporation, mean annual rainfall, main stream slope and forest were

relatively significant in the estimation of the second and third parameters of the LP3

distribution. LOO validation indicated that the ROI based on the minimisation of the

predictive uncertainty leads to more efficient and accurate flood quantiles estimates by both

the QRT and PRT. The regression diagnostics revealed that the catchment variables alone

may not pick up all the heterogeneity in the regional model. Both BGLSR QRT-ROI and

BGLSR PRT-ROI showed improvements in regional heterogeneity with an increase in the

average pseudo coefficient of determination and a decrease in the model error variance,

average variance of prediction and the average standard error of prediction.

Both the standardised residual and QQ-plots of the ROI approach satisfied the underlying

regression model assumptions better than the fixed region. It was shown that both BGLSR

QRT-ROI and BGLSR PRT-ROI produce smaller average RMSEr and REr values when

compared to the fixed region regression approach. Based on the evaluation statistics overall it

was found that there are only modest differences between the BGLSR QRT-ROI and BGLSR

PRT-ROI which suggests that the PRT is a viable alternative to QRT in RFFA.

The RFFA methods developed in this study was based on the database available in eastern

Australia. It is expected that availability of a more comprehensive database (in terms of both

quality and quantity) will further improve the predictive performance of both the fixed and

ROI based RFFA methods presented in this study, which however needs to be investigated in

future when such a database is available.

5.7 UNCERTAINTY ESTIMATION FOR NEW SOUTH WALES, VICTORIA,

QUEENSLAND AND TASMANIA IN A ROI-PRT FRAMEWORK

Here, uncertainty in design flood estimation is examined in a BGLSR multivariate normal

distribution framework, in that the posterior variance of each flood statistic (i.e. mean,

standard deviation, and skew) was combined and the correlation structure between statistics

CHAPTER 5

158

was preserved to assess the uncertainty associated with the flood quantiles (see section 3.10,

Equations 3.50 to 3.52 and Figure 4). It should be noted that this method only considers the

uncertainty arising from the estimation of the flood statistics i.e. sampling errors and inter-

site correlation (as mentioned in section 3.5.3 and Equation 3.31). Other uncertainties were

not considered, such as measurement errors and uncertainty about the choice of distribution.

This method was applied to all the six ARIs and selected sites in the study regions for NSW,

VIC, QLD and TAS. As an example, the results are shown for four catchments, 1 from each

of the four states with varying record lengths (i.e. for NSW = 29 years, VIC = 41 years, QLD

= 62 years and TAS = 24 years). Figure 33 plots the 95% confidence bands from the Monte

Carlo simulation with 10,000 simulation runs (and the FLIKE at-site confidence bands) along

with the at-site and regional estimation. Figure 33 shows that the predicted (expected)

quantiles (blue triangles) are generally well matched with the observed at-site FFA estimates

(black circles); however, the result for TAS is not overly good. It is also reassuring to see that

the quantiles increase with increasing ARI. Taking the case of site 203012 for NSW and ARI

= 100 years, the confidence interval ranges from 303 m3/s to 1597 m3/s, which show a rather

medium to large uncertainty. However, the result may not be considered poor as they match

up reasonably well with the FLIKE at-site confidence limit values (409 m3/s to 2513 m3/s).

Overall, the uncertainty bands estimated for the regional approach were larger than the at-site

ones, which is as expected. Reasons for this may be due to the fact the BGLSR model

corrects for sampling variability and that generally there is more uncertainty associated with

regional estimation. Finally, it can also be seen that the uncertainty increases considerably

with increasing ARI. In any case the framework presented here provides a relatively reliable

basis for uncertainty analysis which would be of great benefit for in real world applications.

CHAPTER 5

159

Figure 33 Design flood quantile estimation and confidence limits curves for ARIs of 2 to 100 years

CHAPTER 5

160

5.8 SUMMARY

This chapter has developed and compared flood prediction equations for the states of New

South Wales, Victoria, Queensland and Tasmania (for 6 ARIs, Q2 to Q100). Both fixed

regions and ROI approaches in a QRT and PRT framework were used, where the quantiles

and parameters (i.e. mean, standard deviation and skew) of the LP3 distribution were

regressed against catchment characteristics predictor variables. The BGLSR procedure was

adopted for the estimation of the regression model coefficients. To assess the performances

of the developed prediction equations a LOO validation procedure was adopted. Overall, it

was found that the QRT and PRT-ROI perform very similarly and that the PRT is a viable

alternative for design flood estimation in ungauged catchments. The developed prediction

equations allow for design flood or flood statistics estimates along with its associated

uncertainty (in the form of confidence limits at any ungauged catchment) given the relevant

catchment characteristics data.

CHAPTER 6

161

CHAPTER 6: RESULTS - MODEL VALIDATION USING LOO AND

MCCV

6.1 GENERAL

This chapter presents the results of the comparison of the Leave-one-out (LOO) and Monte

Carlo cross validation (MCCV) techniques in a hydrological regression framework. Both

ordinary least squares (OLSR) and generalised least squares regression (GLSR) are applied

to the experimental and real datasets. This chapter aims to outline the overall advantages and

disadvantages of the proposed methods for model selection and validation.

The basic theory and assumptions associated with the LOO and MCCV both in an OLSR and

GLSR framework have been discussed in Chapter 3.

6.1.1 PUBLICATIONS

A Journal paper (ERA, rank A*) has been accepted regarding this chapter. The Journal paper

can be found in Appendix A. The following is the reference where the paper can be found.

Haddad, K., Rahman, A., Zaman, M. and Shrestha, S. (2013). Applicability of Monte Carlo

Cross Validation Technique for Model Development and Validation Using Generalised Least

Squares Regression. Journal of Hydrology, doi.org/10.1016/j.jhydrol.2012.12.041.

CHAPTER 6

162

6.2 RESULTS

6.2.1 PREDICTORS USED

The summary statistics associated with the predictor variables used in this analysis are

provided in Table 24, while Table 25 presents the correlation between the log-transformed

predictor variables where it can be seen that there is significant collinearity and

multicollinearity between the design rainfall intensities (ranging 0.73 to 0.94), medium

correlation between rain and evap (0.52) and evap and the design rainfall intensities (ranging

0.40 to 0.58) and modest correlation between sden and rain (0.27) and sden and evap (0.36).

Table 24 Summary of predictor variables (here log10 is used)

Predictor variable Minimum Maximum Mean Standard deviation

log(area) (km2) 2.08 6.92 5.43 1.12 log(2I12) (mm/h) 1.29 2.49 1.77 0.30 log(2I1) (mm/h) 2.97 3.91 3.33 0.23

log(50I12) (mm/h) 1.94 3.27 2.46 0.36 log(50I1) (mm/h) 1.62 1.97 1.76 0.10

log(Itc, ARI), ARI = 10-year (mm/h)

1.94 3.58 2.58 0.42

log(Itc,ARI), ARI = 100-year (mm/h)

2.35 3.97 3.02 0.43

log(evap) (mm) 6.89 7.34 7.11 0.10 log(rain) (mm) 6.23 7.58 6.87 0.28

log(sden) (km/km2) -0.66 1.70 0.92 0.47 log (S1085) (m/km) 0 3.91 2.2 0.81 log(forest) (fraction) -4.61 0 -1.01 1.08

CHAPTER 6

163

Table 25 Correlation between the log10 predictor variables used in the analysis

area 2I1 2I12

50I1 50I12 Itc, ARI=10 Itc, ARI=100 rain evap sden S1085 forest

area 1.00

2I1 -0.08 1.00

2I12 -0.09 0.94 1.00

50I1 0.02 0.94 0.88 1.00

50I12 -0.07 0.92 0.97 0.90 1.00

Itc, ARI=10 -0.70 0.73 0.76 0.65 0.75 1.00

Itc, ARI=100 -0.67 0.73 0.75 0.67 0.77 0.99 1.00

rain -0.24 0.68 0.77 0.54 0.71 0.66 0.63 1.00

evap -0.13 0.58 0.53 0.40 0.49 0.43 0.40 0.52 1.00

sden -0.19 0.31 0.30 0.22 0.26 0.29 0.28 0.27 0.36 1.00

S1085 -0.28 -0.09 -0.02 -0.02 0.02 0.17 0.19 -0.07 -0.27 0.07 1.00

forest 0.15 0.20 0.32 0.23 0.34 0.12 0.14 0.27 -0.07 0.20 0.31 1.00

CHAPTER 6

164

6.2.2 SIMULATED DATA

A number of simulation runs were undertaken on different models with varying random

errors. Here, we discuss the simulation based on the model given by Equations (3.68 and

3.69). The results for OLSR are summarised in Tables 26 and 27 while the results for

GLSR are provided in Tables 28 and 29. The summary tables also provide the results for

the analysis based on the true (i.e. true model) MSEP for both the OLSR and GLSR

models.

For the LOO (i.e. for nv = 1), the model tends to include a greater number of predictor

variables than that required as evidenced by the inclusion of many more predictor variables

than those of the higher nv. This particular feature is evident for both the OLSR and GLSR

techniques. As an example, in Tables 26 and 27, for the OLSR LOO (where nv = 1) and for

= 1, x1 is selected only 42% (210/500) of the cases, while for = 0.2, x1 is selected 51%

(253/500) of the cases. The GLSR results also suffer from over fitting (see Tables 28 and

29); however, the chances for selecting the right model do increase with the GLSR. As an

example, for = 0.95, x1 is selected 53% (263/500) of the cases, while for = 0.25, x1 is

selected 64% (318/500) of the cases. Another important aspect of the LOO for both the

OLSR and GLSR that it tends to underestimate the MSEP of the true model and calibration

data set as compared to the higher nv. Figure 34 illustrates this where it can be seen that as

nv increases the MSEP also increases. It is thus evident that LOO lends itself to over fit the

selected regional regression model.

For the MCCV case, when nv = 45, x1 is included 475 and 492 instances for the OLSR

when = 1 and = 0.2, respectively. This gives an MSEP = 3.50 and 1.49 for the CMCCV

case (see Tables 26 and 27) as compared to 491 and 499 instances for the GLSR for =

0.95 and 0.25, respectively and MSEPs = 1.61 and 0.52 (see Tables 28 and 29). For nv = 1

for both the OLSR and GLSR, the MSEPs = 1.68, 0.61, 0.27 and 0.050 (see Tables 26 to

29) which are relatively smaller when compared to the LOO of the calibration data set (i.e.

2.02, 0.77, 0.41 and 0.11). This implies that the LOO, particularly with the OLSR, has a

much higher chance of selecting a larger model (i.e. a model with a higher number of

predictor variables). From Tables 26 to 29 it can be seen that the MSEP values based on the

CHAPTER 6

165

model selected by the LOO are always greater than the true MSEPs (e.g. 53% (i.e. (2.02-

1.87)/1.87) in Table 26 for the OLSR when = 1).

Tables 26 and 27 also reveal that collinearity (i.e. between variables x1 and x2) is more

prominent for the OLSR LOO case especially when the random errors are highly spread

(when = 1). This can be also seen in Figure 34 for nv = 1, where the combined variables

x1 and x2 have relatively closer MSEP values to the variable x1. For the GLSR, the

collinearity is not a major issue for both = 0.95 and = 0.25 (see Tables 28 and 29) and

the varying cross correlation between sites.

For example, x1 and x2 (which are made highly correlated, see Chapter 3, section 3.11.4)

appear in the model many more instances in the OLSR (e.g. 155, 187, ... times in Table 26)

than in the GLSR (e.g. 93, 105, ... times in Table 28). Since the GLSR analysis recognises

the sampling error as a separate component to the total error, it seems that the GLSR can

distinguish very well between the predictor variables in contrast to the OLSR. Since both

sampling error and model error are lumped together, the OLSR model pushes for more

predictor variables to compensate for the higher model uncertainty. From these results, it

can be seen that the GLSR, with a relatively high spread of error (e.g. = 0.95) and

modest correlation between sites, provides reasonable results with the LOO validation as

compared to the OLSR LOO case. Hence, it may be concluded that the LOO is better

suited with the GLSR than with the OLSR in regional hydrologic regression.

From Tables 26 to 29 the following points may be noted. The chance for the MCCV to

select the true model (that includes only x1 as predictor) increases with increasing nv. This

can be observed with both the OLSR and GLSR models; however, the results for the GLSR

are slightly better. Uncertainty is therefore reduced for the model selected by the MCCV

(i.e. decrease in over fitting). What is also noticeable from Tables 26, 27, 28 and 29 is that

as nv increases some of the predictor variable combinations are not selected at all (i.e.

shown as zero in the table). This illustrates that in most cases the MCCV method would

choose the best model. Looking at Figure 34 for nv = 25 and 35, it is evident that both the

OLSR and GLSR MCCV select the predictor variable x1 consistently better than any other

variable. This is especially the case for the GLSR MCCV case as it has the smaller MSEPs.

When the MSEP (i.e. predictive variance) is smaller and when there is medium to high

correlation between sites, the GLSR MCCV should be the preferred option for validation

CHAPTER 6

166

(as evident in Figure 34). The GLSR with modest cross correlation and larger random

errors also provides relatively better results in most cases. In addition, the collinearity

seems to have no major influence in choosing the correct predictor variable for the MCCV

case (see Figure 34, i.e. nv = 15, 25 and 35). Furthermore, the GLSR appears to be the

superior regression approach when the model errors are modest and when there is

reasonable sampling uncertainty from site to site.

In all the cases, the MSEP values for the MCCV depend significantly on nv. From Tables

26 to 29, it is clear that the use of MCCV to estimate the MSEP of the selected model when

nv > 25 may not be appropriate as the MSEP increases modestly (i.e. nv also increases for

the calibration set). In nearly all the cases for the OLSR and GLSR, with the varying

random errors and cross correlations, the MCCV seems to estimate MSEP based on the

selected model with similar level of accuracy to that of the CMCCV (for Equations 3.68

and 3.69). In nearly all the cases, CMCCV stays around the acceptable limits of the true

MSEP up to nv = 25. From Tables 26 to 29, it is observed that CMCCV may be a good

candidate to be used to estimate the prediction ability of the selected model overall, as

CMCCV tends to stay within acceptable limits around the MSEP of the selected model (for

Equations 3.68 and 3.69). Thus, with nv = 15 to 25 (representing 30% to 50% of the

catchments), the MCCV and the CMCCV estimate the MSEP with reasonable accuracy.

Table 26 Results from simulated data, OLSR when 2 = 1

Frequencies of variables being selected

Values of optimal MSEP

Based on Eq.(3.68) Simulated data True model

nv x1 x1, x3

x2, x3

x1, x2

x2 or x3

LOO MCCV CMCCV LOO MCCV TMSEP

1 210 83 17 155 35 1.68 2.02 1.87 15 290 25 0 187 3 2.48 2.38 2.54 20 410 37 0 53 0 2.03 1.94 2.15 25 410 5 0 85 0 1.99 1.91 1.97 30 418 30 0 53 0 2.49 2.40 2.34 35 423 5 0 73 0 2.99 2.89 2.80 40 445 3 0 55 0 3.60 3.47 3.51 45 475 0 0 25 0 3.74 3.50 3.58

CHAPTER 6

167

Figure 34 The mean squared error of prediction (MSEP) associated with LOO and MCCV for OLSR and GLSR simulations

CHAPTER 6

168

Table 27 Results from simulated data, OLSR when 2 = 0.04




nv x1 x1, x3

x2, x3

x1, x2

x2 or x3


1 253 48 33 98 68 0.61 0.77 0.74 15 280 25 17 163 16 0.72 0.63 0.76 20 365 60 0 75 0 1.28 1.19 1.39 25 393 55 0 52 0 1.16 1.08 1.27 30 445 25 0 30 0 1.31 1.22 1.26 35 469 15 0 16 0 1.42 1.32 1.38 40 481 8 0 11 0 1.59 1.46 1.51 45 492 3 0 5 0 1.72 1.49 1.65

Table 28 Results from simulated data, GLSR when 2 = 0.903 and )ˆ,ˆ(ˆ ji yy = 0.30




nv x1 x1, x3

x2, x3

x1, x2

x2 or x3


1 263 3 88 93 53 0.27 0.41 0.36 15 350 28 15 105 3 0.40 0.30 0.47 20 370 83 35 13 0 0.54 0.45 0.64 25 397 80 23 0 0 0.94 0.86 1.10 30 420 65 5 10 0 1.44 1.35 1.42 35 460 25 5 10 0 1.53 1.43 1.50 40 483 15 0 2 0 1.71 1.58 1.65 45 491 8 0 1 0 1.80 1.61 1.72

Table 29 Results from simulated data, GLSR when 2 = 0.063 and )ˆ,ˆ(ˆ ji yy = 0.70



Based on Eq.(3.69) Simulated ata True model

nv x1 x1, x3

x2, x3

x1, x2

x2 or x3


1 318 68 35 45 35 0.050 0.11 0.095 15 470 0 0 10 20 0.081 0.065 0.088 20 475 10 2 5 8 0.12 0.10 0.136 25 480 10 0 10 0 0.12 0.10 0.132 30 480 15 0 5 0 0.21 0.13 0.18 35 488 5 5 3 0 0.22 0.13 0.20 40 491 3 3 3 0 0.32 0.22 0.27 45 499 1 0 0 0 0.63 0.52 0.54

CHAPTER 6

169

6.2.3 APPLICATION WITH OBSERVED REGIONAL FLOOD DATA IN NSW

Given the 12 predictor variables shown in Table 24, some of them may have minor effects

on the estimation of 10-year and 100-year average recurrence interval (ARI) flood

quantiles (Q10, Q100). In order to select the best set of predictor variables for the regression

models, LOO and MCCV in an OLSR and GLSR frameworks were initially applied for the

calibration data set (60 sites were selected randomly out of the 96 as the calibration data

set). The results are listed in Tables 30 and 31. The optimal OLSR and GLSR LOO both

select three predictor variables. The obtained models along with some summary statistics

are provided in Table 32.

In the MCCV (considering nv = 15, 20, 25 and 30 catchments during the validation and

undertaking 500 simulations), the optimal OLSR and GLSR MCCV each select two

predictor variables as shown in Table 32.

From a goodness-of-fit perspective, it seems that there is no notable difference between

models represented in Table 32 as the coefficients of the regression equations are very

similar and certainly the summary statistics (i.e. R2 / R

2GLSR and standard error of prediction

(SEP (%)) also show some resemblance between the OLSR and GLSR. However, when

comparing the performances of the four different models from Tables 30 and 31 on the

prediction data sets, the differences can be clearly illustrated. Initially from Tables 30 and

31, it can be clearly seen that the GLSR models provide the lowest MSEPs for the LOO

and the MCCV suggesting that the sampling errors have had a relatively notable impact in

the analysis. From Table 30, the OLSR LOO provides an MSEP of 0.11, which is

significantly larger than 0.042, the MSEP based on the OLSR MCCV (for nv = 25). From

Table 31, the GLSR LOO provides an MSEP of 0.092, which is also significantly larger

than 0.016, the MSEP based on the GLSR MCCV case (for nv = 20, 25 and 30).

Tables 30 and 31 clearly indicate that the LOO validation (for both the OLSR and GLSR)

has included one additional/unnecessary predictor variable in the Q10 model. What is also

striking is that for the OLSR and GLSR at nv 20 both select the same predictor variables

even though there was quite a bit of multicollinearity between the potential predictor

variables (as shown in Table 25). This shows that the MCCV is not adversely affected by

multicollineaity and that MCCV would most often provide the best model when significant

CHAPTER 6

170

multicollinearity is present. From Tables 30 and 21, the MSEP values can be considered to

be relatively smaller for both the OLSR and GLSR; this is however more noticeable for the

GLSR, which again reiterates the fact that when the random errors are relatively smaller,

the MCCV is likely to provide the best results for both the OLSR and GLSR cases.

What is noteworthy is the relatively better correction given by the CMCCV to estimate the

MSEP when nv = 20 for both the OLSR and GLSR (see Tables 30 and 31). As nv increases,

the reliability of the CMCCV is also reasonable even though there are fewer sites for model

building and the error for the CMCCV to estimate MSEP may increase a little in this

situation. It is thus found that the MCCV selects a better model (with smaller number of

predictor variables) than the LOO for both the OLSR and GLSR cases. The results in Table

30 and 31 are mostly in agreement with the results from the numerical experiments.

Figure 35 shows the graphical results of the prediction errors (i.e. predicted - observed) of

the predicted flood quantile obtained by the regression equations in Table 32 for the 36

validation catchments against at-site flood frequency estimates. Clearly the prediction

errors are smaller for the GLSR LOO and GLSR MCCV cases. The prediction performance

is better for the MCCV models in both the cases. This shows the typical manifestation of

over fitting often caused by the LOO validation approach. Typically the results look good

for the LOO for the calibration data set; however, when one needs to predict future samples

(i.e. ungauged catchment prediction) MCCV should be used in selecting the optimal

hydrologic regression models. This would lead to less uncertainty in regional flood quantile

estimation. These results also lead one to make the note that the GLSR MCCV provides the

best model and validation procedure as compared to the OLSR.

CHAPTER 6

171

Table 30 OLSR analysis, MSEP values for calibration and validation data set (observed

data from NSW). Here log10 is used

OLSR MSEP on calibration set MSEP on validation set

nv Model variables* LOO MCCV CMCCV Model by

LOO Model by MCCV

1 1, 5,8 0.048 0.11

15 1, 5,7 0.050 0.045 0.048

20 1,5 0.048 0.044 0.041

25 1,5 0.049 0.045 0.042

30 1,5 0.049 0.045 0.042

*Corresponding predictor variables: 1. log(area); 2. log(2I12); 3. log(2I1); 4. log (50I12); 5. log(Itc,ARI), 6. log(evap); 7. log(rain); 8. log(sden); 9. log(S1085); 10. log(forest).

Table 31 GLSR analysis, MSEP values for calibration and validation data set (observed

data from NSW). Here log10 is used

GLSR MSEP on calibration set MSEP on validation set

nv Model variables* LOO MCCV CMCCV Model by

LOO Model by MCCV

1 1, 5,8 0.019 0.092 15 1, 5,6 0.020 0.017 0.021 20 1,5 0.018 0.016 0.016 25 1,5 0.018 0.016 0.016 30 1,5 0.019 0.017 0.016

*Corresponding predictor variables: 1. log(area); 2. log(2I12); 3. log(2I1); 4. log (50I12); 5. log(Itc,ARI), 6. log(evap); 7. log(rain); 8. log(sden); 9. log(S1085); 10. log(forest).


along with summary statistics

Regression type/ validation

Regression equation R2 / R2GLSR SEP

(%)

OLSR LOO 2.50 + 1.13log(area) + 1.85log(Itc-10)+

0.07log(sden)

79% 32%

GLSR LOO 2.51 + 1.13log(area) + 1.80log(Itc-10)+

0.05log(sden)

81% 29%

OLSR MCCV 2.49 + 1.14log(area) + 1.88log(Itc-10) 79% 33%

GLSR MCCV 2.51 + 1.13log(area) + 1.82log(Itc-10) 81% 30%

.

CHAPTER 6

172

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0 4 8 12 16 20 24 28 32 36

Site No.

Pre

dict

ion

Err

or

OLSR LOO VALGLSR LOO VALOLSR MCCV VALGLSR MCCV VAL

Figure 35 Prediction error plot for Q10 results (models selected by OLSR and GLSR LOO and models

selected by OLSR and GLSR MCCV)

The MSEP values for the quantile estimate for Q100 are listed in Table 33. Initially, LOO is

carried out on the calibration data set of 60 catchments. The optimal OLSR LOO selects 4

predictor variables in the model, which are log(area), log(Itc_100), log(rain) and log(S1085)

and the optimal GLSR LOO selects 3 predictor variables, which are log(area), log(Itc_100)

and log(rain). The obtained model along with the summary statistics is provided in Table

34.

MCCV was then carried out using both the OLSR and GLSR on the validation data set of

36 catchments. Leaving out 50% of the catchments at a time (i.e. nv = 18) for validation and

performing Monte Carlo simulation 500 times, it is found that the optimal OLSR MCCV

selects 3 predictor variables (log(area) , log(Itc_100) and log(rain)) while the optimal GLSR

MCCV selects 2 predictor variables (log(area) and (Itc_100)). The obtained models along

with the summary statistics are provided in Table 34.

CHAPTER 6

173

Table 33 MSEP for ARI = 100-year

MSEP on calibration set MSEP on test set

LOO MCCV CMCCV Model by

OLSR LOO

Model by OLSR

MCCV 0.069 0.074 0.070 0.12 0.096

LOO MCCV CMCCV Model by

GLSR LOO

Model by GLSR

MCCV 0.045 0.060 0.055 0.090 0.083


along with summary statistics

Regression type/

validation

Regression equation R2 /

R2GLSR

SEP

(%)

OLSR LOO 2.97 + 1.07log(area) + 2.07log(Itc_100) -

0.88log(rain) - 0.15log(S185)

71% 30%

GLSR LOO 3.01 + 1.04log(area) + 1.84log(Itc_100) -

0.59log(rain)

70% 26%

OLSR MCCV 2.96 + 1.09log(area) + 2.02log(Itc-100 ) –

0.70log(rain)

70% 32%

GLSR MCCV 3.02 + 1.02log(area) + 1.59log(Itc-100 ) 69% 25%

.

From the comparison of all the regression equations in Table 34 it is evident that the

performances of all these models are very similar, i.e. they all have SEP values within

similar ranges; however, the GLSR Sep values are slightly better. In terms of R2 and R2GLSR,

it can be seen that the OLSR LOO has slightly higher values. What can also be observed is

the number of predictor variables in the OLSR LOO model. Referring to Table 25, it can be

seen that log(Itc_100) and log(rain) are moderately correlated; this may therefore introduce

the problem of over fitting. This result is similar to the result found in the simulation study

where the OLSR LOO tended to include more predictor variables for the true model (see

Tables 26 and 27). Therefore, in the case of prediction ability, the conclusion that OLSR

LOO is the best model due to a higher R2 may be deceptive. In order to confirm this all the

regression equations in Table 34 were finally used to make predictions on the validation

data set of 36 catchments.

CHAPTER 6

174

Figure 36 shows the graphical results from this validation. It is observed that the prediction

performance of the OLSR MCCV, GLSR LOO and GLSR MCCV are all slightly better

than that OLSR LOO; in fact, the GLSR MCCV is the best performer even though it has

only 2 predictor variables and a slightly smaller R2

GLSR. The fact that the GLSR has the

smaller prediction errors and in turn the lower MSEPs (i.e. predictive uncertainties, see

Table 33) actually reduces the need to have more predictor variables in the model. This is

in line with the simulation results where it was found that the GLSR tended to pick the true

model more frequently than the OLSR LOO and OLSR MCCV and more so when MSEPs

were relatively smaller. From Table 33, the MSEP value for the OLSR LOO is 0.12, which

is notably larger than the OLSR MCCV (0.096), GLSR LOO (0.090) and GLSR MCCV

(0.083).

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 4 8 12 16 20 24 28 32 36

Site No.

Pre

dict

ion

Err

or

OLSR LOO VALGLSR LOO VALOLSR MCCV VALGLSR MCCV VAL

Figure 36 Prediction error plot for Q100 results (models selected by OLSR and GLSR LOO and models

selected by OLSR and GLSR MCCV)

Clearly these results indicate that there might be problems with the OLSR LOO model,

which has four predictor variables. Therefore, in this case, the fitting performance appears

to be better upon a first glance; however, these extra predictor variables are unnecessary for

the model. The extra variables in this case can reduce the prediction ability of the model; in

other words, additional uncertainty is introduced into the model by over fitting. An

important fact is that, when estimating the prediction ability of the model, the optimal

CHAPTER 6

175

OLSR LOO and GLSR LOO on the calibration data set seem to both underestimate the

MSEP on the validation data set. This is evident in Table 33 where the MSEP values of the

optimal OLSR LOO and GLSR LOO for the calibration data set are smaller than that of the

validation data set (0.069 < 0.12 and 0.045 < 0.090 for the OLSR LOO and GLSR LOO,

respectively). For the OLSR MCCV and GLSR MCCV, the MSEP of the optimal MCCVs

on the calibration data set are 0.074 and 0.060 respectively, which are also greater than the

MSEPs by the OLSR LOO and GLSR LOO, respectively. This supports the notion that the

MCCV most often would report a better or more accurate estimate of MSEP for the

selected model as compared to the LOO approach. The results in Table 33 and 34 are

mostly in agreement with the simulation results.

6.3 SUMMARY

Selection of the right regression model and estimation of its predictive ability are important

steps in regional hydrologic regression analysis, which are usually undertaken by some

kind of validation. This study assesses the performances of the most commonly adopted

LOO validation with the relatively new MCCV procedure. This analysis is carried out

under the frameworks of OLSR and GLSR for the estimation of flood quantiles. This study

uses a simulated data set and observed regional flood data set from the state of New South

Wales in Australia.

It has been found that when developing regional hydrologic regression models, application

of the GLSR MCCV is likely to result in more parsimonious model than the OLSR LOO,

OLSR MCCV and GLSR LOO cases. The GLSR MCCV has been found to show the

smallest mean squared errors and fewer instances of problems with collinearity as

compared to the OLSR LOO and OLSR MCCV cases. It has also been found that the

MCCV and corrected Monte Carlo cross validation (CMCCV) can provide a more

reasonable estimate of a model’s predictive ability than the LOO. Furthermore, the

CMCCV has the potential to offer reasonable improvement over the MCCV in estimating

the predictive ability of a regional hydrologic regression model.

The findings of this study has some major implications on the way that is usually adopted

in hydrologic regression analysis to estimate regression coefficients using automated

statistical packages, which solely rely on the statistical significances of the regression

CHAPTER 6

176

coefficients in selecting an appropriate regression model. While in some cases the selected

models, developed using statistical packages, seem to be performing well, they may not

perform equally well in the application of the model to the real ungauged catchment case,

as these models have not been extensively validated using a more powerful model

validation technique such as MCCV.

CHAPTER 7

177

CHAPTER 7: BACKGROUND AND DEVELOPMENT OF THE

LARGE FLOOD REGIONALISATION MODEL AND ISSUES

RELATING TO SPATIAL DEPENDENCE

7.1 GENERAL

Firstly, this chapter provides an overview of inter-site dependence in annual maximum

flood series (AMFS) data for Australia. Secondly, the determination of homogenous

regions and the identification of an appropriate probability distribution are discussed in

some detail. A brief outline of the formulation of the heterogeneity measure by Hosking

and Wallis (1993) and the bootstrap Anderson-Darling (AD) test is then given. The

development and calibration of the large flood regionalisation model (LFRM) assuming

spatial independence is derived and discussed. The issues relating to concurrent record

lengths for the establishment of meaningful networks to carry out the analysis of spatial

dependence is presented. The theoretical aspects of inter-site dependence and the

estimation of the number of independent sites (Ne) in regional flood frequency analysis

(RFFA) using a simple model based on the generalised extreme value methodology are also

discussed. Finally, given the limitations of the real data set to give clearly meaningful

results in relation to the derivation of Ne because of issues with sampling variability and

homogeneity, this chapter discusses how synthetic datasets were generated for each of the

regions for use in the analysis.

7.1.1 PUBLICATIONS

A journal paper (ERA, rank B) has been published (details below and full paper in

Appendix A) regarding the initial pilot study undertaken on the LFRM for the states of

New South Wales (NSW) and Victoria (VIC). The work presented in this chapter and

Chapter 8 is an extension to the work presented in the published paper which is based on

the data from all over Australia and a new spatial dependence model for the AMFS data.

Haddad, K., Rahman, A. and Weinmann, P.E. (2011b). Estimation of major floods:

applicability of a simple probabilistic model, Australian Journal of Water Resources, 14

(2), 117-126.

CHAPTER 7

178

7.2 LFRM CONCEPT

The LFRM concept is identical to the basic concept of station-year methods: observed data

from an assumed homogenous region are pooled and a non-parametric flood frequency

curve is fitted on a probability plot. The homogeneity assumption for the LFRM concept is

very similar to that used in the index flood approaches; however the traditional approach in

the index flood approach is to achieve an acceptable degree of homogeneity within the

region by standardising by the at-site mean or median values. In the same spirit the LFRM

is also based on a standardisation by not only taking into account the at-site mean but also

the at-site CV values of the time series data. This unique form of standardisation allows the

pooling of more data from many stations compared to the standard index methods (see

section 2.8.1 for more details). Indeed, it is well known that any station-year method

suffers from problems associated with inter-site dependence (see section 2.2.3). These

issues have been minimised in the LFRM by using an effective number of independent

stations concept, similar to CRC-FORGE (Nandakumar et al., 1997, 2000), as described in

sections 7.8 and 7.9.

7.3 INTER-SITE DEPENDENCE IN GENERAL FOR THE LFRM

The LFRM technique presented by Majone et al. (2007) (called Probabilistic Model) and

further enhanced version by Haddad et al. (2011b) ignore the inter-site dependence

structure of the pooled standardised data, where the highest data point from each station’s

annual maximum flood series (after standardisation) is combined with those from the other

stations in the region to form a database referred to as ‘LFRM data series’. It was assumed

that the individual values in the LFRM data series are independent. This assumption may

be valid if the data being pooled come from stations that are spread over a very large

region. However, examination shows (Figure 37) that values in the LFRM data series used

in this study tend to cluster in some years, with very few events in other years. This appears

to violate the assumption of independent distribution of the events in time and indicates

that some of the events occurring in the same year might have resulted from the same

hydro-meteorological events. However, if the events are separated by at least a few months,

they may be treated as being independent. For example, the same meteorological event may

cause floods in different parts of Australia that are a few weeks apart (since Australia is

CHAPTER 7

179

quite large) and hence these events cannot be treated as independent. However, a separation

period of at least one month may safely be taken as a criterion for meteorological

independence.

Significant inter-site dependence between events in the pooled series of annual maxima

used in RFFA will result in the effective size of the sample being over-estimated, and the

annual exceedance probabilities of given flood magnitude being underestimated. The

testing of the LFRM by Haddad et al. (2011b) has demonstrated that if the Australian

LFRM data series is assumed to be independent, the LFRM tends to underestimate the at-

site flood frequency estimates. It was shown by Haddad et al. (2011b) that 17 out of the 18

test catchments gave an underestimation by 7% to 40%. This result clearly indicates that

the issue of inter-site dependence needs to be addressed for successful application of the

LFRM in Australia. It should be mentioned here that in estimating the inter-site correlation,

the concurrent record lengths were considered i.e. the start and end years were the same for

a pair of stations.

The dependence structure among the concurrent AMFS data of all the possible pairs of

sites, irrespective of their ranks, (these data have been prepared as a part of ARR Project 5)

was examined and it was found that the cross-correlation coefficients are quite high for the

nearby pairs of sites. An example is shown in Figure 38 where two nearby VIC stations

(Stations 221201 and 221207) show a dependence structure (i.e. cross-correlation

coefficient of 0.96). The correlation vs. distance between pairs of stations in VIC is shown

in Figure 39, which indicates that the AMFS data have cross-correlation close to 1 for some

nearby stations, but cross-correlation reduces with distance sharply. Also, high correlation

is a dominant issue only for a limited number of pairs of stations.

CHAPTER 7

180

NSW, QLD, VIC, TAS (e.g. taking one max value from each station)

0

2

4

6

8

10

12

1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

Year

Q/m

ean

Figure 37 Occurrences of the highest floods – data from NSW, QLD, VIC and TAS are combined (only

the highest value from each station’s AMFS data is taken to form the LFRM data series)

Cross correlation (r) = 0.96

y = 0.7965x ‐ 826.47

R2 = 0.9234

0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000 12000 14000

Q ML/day (Station 221201)

Q M

L/day (Station

221207)

Figure 38 Cross-correlation between two nearby Victorian Stations 221201 and 221207(Considering all

concurrent AMFS data over the period of records – only 21 data points are concurrent for the pair of

stations)

CHAPTER 7

181

The cross-correlation between two stations based on all the concurrent AMFS data has little

relevance to the LFRM model as this model uses up to the rank 5 data i.e. the five highest

flood values from the annual maximum series of each station. Also the degree of

correlation for rare flood events might not be the same as for relatively frequent events. A

viable approach would be to use average cross-correlation considering all the concurrent

AMFS data from all the possible pairs of stations in the database and develop a spatial

dependence model similar to the CRC-FORGE method (Nandakumar et al., 1997). This

model can then be used to account for the spatial dependence in the LFRM data series in

flood quantile estimation using the LFRM.

Another approach might be to examine the start dates of the individual events which

contain the annual maxima for all the sites plotted against the same year (e.g. as in Figure

37); if the starts of the events are a few months apart from each other they may be treated

as independent. If they have resulted from the events which have occurred on the same day

or week, only one data point from these can be retained to establish an independent series.

Here, if the stations are far away (e.g. one station from VIC and another from Queensland

(QLD) they may be treated as independent, although plotted against the same year; they are

most likely to have resulted from different hydro-meteorological events. This approach

requires the examination of the distances between pairs of stations and the start and end

dates of the individual events. While it is quite possible to do this, it would certainly

require more effort (extra programming) which is time demanding.

Any significant degree of dependence between the events in a regional sample reduces the

effective sample size drastically, so the most productive approach might be to establish

essentially independent networks of stations (perhaps by using the concept of de-

correlation distance as an indicator) and then only pool the maxima from such a network of

stations. Some form of constrained random sampling will need to be used to establish a

number of alternative networks of independent stations (see sections 7.8 and 7.9 for further

details).

CHAPTER 7

182

Figure 39 Relationship between the cross-correlations among AMFS data and distance between pairs

of stations in Victoria

7.4 ANNUAL MAXIMUM DATA SET USED IN THE LFRM

As mentioned in Chapter 4, 682 gauging stations are available in Australia that have

reasonable record lengths (19 to 96 years) and are suitable for RFFA analysis. One does

expect that the useful information for RFFA increases with the increasing number of

stations in the region; however, the net information does not increase proportionally with

the increasing number of stations within a given region, due to spatial dependence between

data at gauging stations. While the shorter record lengths in this study (< 25 years) would

introduce notable uncertainty in parameter estimation, they were included as they still

contain useful additional information for the pooled data set. However uncertainty will be

introduced from the errors in the standardisation of parameters. From the 682 stations

shown in Figure 16, two datasets for the LFRM were established: (a) From the 682 stations,

626 stations were selected that had a reasonable concurrent record length. (b) From the

remaining 56 stations, 28 stations were randomly sampled and put aside for testing and

validation with the LFRM. The selected 28 sites are shown in Figure 40.

VIC

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700

Distance Between Stations i and j (km)

Inte

r-si

te C

orre

latio

n

Inter-site Correlation Estimated Correlation

equation of line:

correlation = 0.98 (dij/(0.009*dij+(1))

CHAPTER 7

183

Figure 40 Geographical distribution of the 28 validation catchments for the LFRM

7.4.1 QUALITY CHECK OF THE LARGEST ANNUAL MAXIMA DATA

Any RFFA involves processing of a large amount of data; hence there is a bigger chance

for data errors to go unattended. Also it must be remembered that AMFS data has large

errors associated with the highest flows in the data set because of the nature of the rating

curve extrapolation errors. As discussed in section 7.3, the LFRM uses the largest 1 to 5

observed maxima values from each station in the region. Therefore, any errors in these

observations can introduce significant error into the LFRM final quantile estimates. As

discussed in sections 4.4.3, 4.4.4 and 4.5.3, a rating ratio concept was introduced and used

to cull stations with significant rating curve error. It should be noted here that the adopted

number of data points (the five largest) to be selected from each station in the LFRM has

no solid theoretical justification; however, it is evident that the number should be large

enough to make use of the information from the highest flood events in the region, and

hence the choice of ‘the five largest’ seems to be acceptable. Detailed sensitivity analysis

would be required to allow the selection of an optimum number of data points.

CHAPTER 7

184

7.5 IDENTIFICATION OF AN APPROPRIATE PROBABILITYY

DISTRIBUTION AND TESTING FOR HOMOGENITY OF ANNUAL

MAXIMA FLOOD DATA

In this section, the most appropriate flood frequency probability distribution and the

homogeneity for the Australian data set are examined in the context of the application of

the LFRM technique.

7.5.1 SEARCHING FOR AN APPROPRIATE PROBABILITY DISTRIBUTION

As shown by Majone et al. (2007) and Haddad et al. (2011b), the LFRM concept is

primarily non-parametric and therefore an assumption regarding a particular distribution is

not required. However, it will be shown in section 7.9 that a probability distribution is

fitted to the annual maxima in order to derive a generic relationship for the effective

number of stations (Ne, which is used to adjust the plotting position of the LFRM points).

It can be clearly seen in the literature that the generalised extreme value (GEV) has been

widely used and recommended to describe RFFA extreme data (e.g. see section 2.8.1). The

GEV distribution fitted using the regional L-moment approach has been shown to be

computationally simple. The L-moments are analogous to the conventional moments;

however, they have several theoretical advantages, e.g. being able to model a wider range

of distributions and when estimated from a sample they tend to be more robust to the

presence of outliers in the dataset (Hosking and Wallis, 1997).

In the literature there are many techniques to evaluate distributional assumptions (e.g.

Hosking, 1990; Chowdury et al. 1991; Laio et al. 2009 and Haddad and Rahman, 2011, see

also section 2.2.4 for more references). By using a range of methods (as follow) from the

above literature it was found that the GEV distribution is quite appropriate to approximate

the annual maximum floods in Australia on a state by state basis: (i) the L-moment diagram

and L-moment goodness-of-fit test (i.e. DISTZ ), (ii) the AD goodness-of-fit Monte Carlo test

(the details relating to these goodness-of-fit testes are provided in Appendix D) and (iii)

frequency plots of the fitted and observed data based on L-moments.

With the DISTZ test a fit is declared adequate if DISTZ is sufficiently close to zero, a

reasonable criterion being DISTZ 1.64. The AD test results are reported as P-values for a

significance level of 5%. Hence a value of P > 0.95 suggests that the particular distribution

as the parent is not significant / unsupported.

CHAPTER 7

185

7.5.2 GOODNESS-OF-FIT TEST RESULTS

Figure 41 shows the LSK vs. LKT plots for the annual maximum flood data for the state of

NSW and QLD (how the average values for the states were obtained can be seen in

Appendix D). The plots also illustrate that the theoretical curves for some common 2 and 3-

parameter distributions (normal (NORM), log-normal 2 (LN), gamma (GAM), extreme

value type 1 (EV1), uniform (UNIF), GEV, Pearson 3 (P3), generalised logistic (GLO),

generalised pareto (GPA) and the log-normal 3 (LN3)). The rest of the L-Skewness vs. L

Kurtosis plots for the other states can be seen in Appendix C.

From Figure 41 it is evident that the distributions of annual maximum flood series data for

NSW and QLD are from different parent distributions (the regional LSK vs. LKT (red

colour) both fall on different theoretical curves i.e. GPA and P3, respectively). This

difference was seen for all the states where the annual maxima cannot be fully described by

one single distribution. The summarised results are shown in Table 35 which shows a

mixed result. It can be seen that the L-moment diagrams sometime provide different

outcomes as those of the DISTZ and AD tests, making it harder to determine a single

outstanding distribution. For example, if we look at the results for TAS, it is observed that

both the L-moment diagram and DISTZ test select the P3 distribution; in contrast, the AD

test selects the GLO distribution. The difference in results may be attributed to sampling

variability and the fact that different tests examine different aspects of the goodness-of-fit

of a candidate distribution. However, what does stand out from Table 35 is the distributions

that are selected most by all the tests are the GPA, GEV and P3. To make a more informed

conclusion all these 3 distributions (i.e. GPA, GEV and P3) are fitted and plotted and

superimposed on the standardised data (data from individual sites standardised by the mean

(as for the index method)) for each state and they were then visually inspected. Figure 42

illustrates these plots for WA and TAS. Sample plots are also shown for NSW and VIC in

Appendix C.

CHAPTER 7

186

NSW

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis

GLO LN LN3 GAMNORM P3 GEV EV1UNIF GPA RAve

QLD

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis


Figure 41 L-moment ratio diagrams of annual maximum flood data for NSW and QLD

CHAPTER 7

187

Table 35 Summary of goodness-of-fit tests for determining parent distribution

DISTZ AD State/

Distribution GLO GEV LN3 P3 GPA

L-

moment

diagram

GLO GEV LN3 P3 GPA

NSW 9.4 7.2 3.9 -1.8 0.21 GPA 1.0 1.0 1.0 0.98 0.78

QLD 12.6 9.8 6.0 -0.4 0.94 P3 1.0 1.0 1.0 0.32 0.92

VIC 10.3 6.3 3.4 -1.5 4.3 P3 1.0 1.0 1.0 0.97 1.0

TAS 6.6 3.74 2.7 0.8 -3.0 P3 0.88 1.0 1.0 1.0 1.0

WA 2.4 0.22 -2.9 -8.2 -6.5 GEV 1.0 1.0 1.0 1.0 1.0

NT 3.0 0.6 -0.9 -3.6 5.5 GEV 1.0 1.0 1.0 1.0 1.0

SA 7.0 4.8 3.4 0.9 -0.9 GPA 1.0 1.0 1.0 0.90 0.89

Based on the visual inspection of Figure 42 and the figures in Appendix C, the GEV and P3

appear to be good candidates to describe the AMFS data for the different Australian states.

While all the distributions fit the lower end quite well, the GEV and P3 distributions seem

to capture the higher flows much better than the GPA distribution. Given that the LFRM

uses the top 5 maxima it would be far more better to go with a distribution that can

extrapolate relatively well into the higher flow range without showing too much bias in the

extrapolation. Finally, based on the GEV and P3 distributions, the median relative error

(MRE) (MRE = (fitted-observed)/observed)) in percentage was calculated for each of the

states to determine if the fitted distribution under/over estimated the observed value. Table

36 summarises the MRE values for the different states. Table 36 suggests that the GEV

distribution provides the minimum bias as compared to P3 for most of the states. While

there is not a big difference in the results (e.g. NSW), it does provide some guidance along

with the other results on choosing a distribution for use with the LFRM. Hence, based on

all these results it can be argued that the GEV distribution can be taken as the best-fit

distribution in this application.

CHAPTER 7

188

Table 36 Summary of MRE associated with the GEV and P3 distributions

Median relative error (%)

State / Distribution GEV P3

NSW 0 -1

QLD -1 0

VIC -1 -2

TAS 0 -10

WA 0 -8

NT -0.3 15%

SA -1 0

CHAPTER 7

189

0

4

8

12

16

20

1 10 100 1000 10000

ARI (Years)

Sta

ndar

dise

d da

ta

Observed Data

GEV (WA)

GPA (WA)

P3 (WA)

0

2

4

6

8

10

1 10 100 1000 10000

ARI (Years)

Sta

ndar

dise

d da

ta

Observed Data

GEV (TAS)

GPA (TAS)

P3 (TAS)

Figure 42 Visual inspection of distributional fit for GEV, GPA and P3 distributions for WA and TAS

CHAPTER 7

190

7.6 HOMOGENEITY

7.6.1 HOMOGENEITY TEST OF HOSKING AND WALLIS

In identifying homogenous groups from a large number of sites, a balance needs to be

maintained between selecting a reasonable sized group with more information, but being of

a lower degree of homogeneity, and a smaller group with reduced information and showing

a greater degree of homogeneity. The aim of this balancing act is to make the best use of

the information available between group size and the degree of homogeneity being

achieved. It still remains however that a small group that shows good homogeneity may not

be appropriate for use in a RFFA study such as the LFRM as small groups may not be able

to provide statistically meaningful results.

Also, the heterogeneity measure of Hosking and Wallis (1993) H statistic used here (as

explained later) to measure homogeneity has a tendency to give false impressions of

homogeneity for small regions (further discussions on this can be seen in Hosking and

Wallis, 1993). Nonetheless, there does not appear to have any strict rules or guidelines on a

minimum number of sites required to define a homogenous group or region. What is worth

mentioning is that homogeneity assumption is often used explicitly with the index flood

and similar methods. A number of sites forming a homogenous group mean that the

underlying probability distribution of the standardised flood variables for the sites is the

same allowing for sampling variability, which implies that the standardised annual

maximum flood series for the sites are samples from the same population. Given that this

LFRM study is largely based on the station-year method and that the LFRM makes use of

the top five maxima from each site in the region, homogeneity may not be a strict

prerequisite here. However, having homogenous region is advantageous, as this would

certainly reduce the model error inherent in the regional model and would give more

accurate flood estimates applicable to the region of interest. In this section, two

homogeneity tests are applied i) the heterogeneity measure of Hosking and Wallis (1993)

and ii) the bootstrap AD test. A brief explanation on each of these tests is given below

followed by the results of each method applied to each state of Australia. The details

relating to the homogeneity test of Hosking and Wallis (1993) are provided in Appendix D.

CHAPTER 7

191

7.6.2 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST

A test that does not make any assumptions on the parent distribution is the AD rank test

(Scholz and Stephens, 1987). The AD test is the generalisation of the classical Anderson-

Darling goodness-of-fit test (e.g. D’Agostino and Stephens, 1986), and it is used to test the

hypothesis that k independent samples belong to the same population without specifying

their common distribution function. The details relating to the homogeneity test based on

the AD statistic are provided in Appendix D.

7.6.3 TESTING FOR HOMOGENEITY – RESULTS

The method proposed by Hosking and Wallis (1993) and the approach based on the

bootstrap AD test (D’Agostino and Stephens, 1986 and Laio, 2004) as discussed above,

were used to measure the degree of heterogeneity in each Australian state. In applying the

procedure, a simulation of 1000 homogenous regions (i.e. Nsim = 1000), with heterogeneity

measures H(1), H(2) and H(3) being computed using a FORTRAN program developed by

Hosking (1991a). The heterogeneity measure AD was calculated using the nsRFA software

in the “R environment statistical software” package.

It was hypothesised that the different states of Australia are separate region and the testing

was carried out based on this particular assumption. For this, the obtained H and AD values

are given in Table 37.

Table 37 Summary of heterogeneity measures for the Australia states

Heterogeneity measures

State H(1) H(2) H(3) AD

NSW 14 10 5.7 1.0

QLD 17 13 8.2 1.0

VIC 22 14 7.6 1.0

TAS 26 10.5 4.6 1.0

WA 21 12 5 1.0

NT 9 6.1 4.9 1.0

SA 11 8 3.5 1.0

CHAPTER 7

192

From Table 37 it can be clearly seen that all the states are “definitely heterogeneous” as all

the H statistics are much greater than 2. The AD test ( AD ) also supports this result with all

the P-values being 1.0, which shows homogeneity not to be significant at a test significance

level of 5%. While there were discordant sites in the analysis, they were not removed as the

successful development of the LFRM depends on a large number of sites and as such these

sites may contain useful information which may capture significant regional variability

which is required for the LFRM. One important aspect that is kept in mind with the results

obtained in Table 37 is that Australian hydrology is quite variable even from state to state,

and that catchments even within close proximity to each other have quite different physical,

topographical and meteorological features, hence obtaining homogenous regions is quite

difficult. Similar results have been found in previous studies concerning Australian flood

data (e.g. Bates et al. 1998; Rahman et al. 1999; Haddad, 2008 and Ishak et al. 2011).

Unfortunately the two homogeneity tests referred to above may be of limited relevance to

the estimation of large to rare floods. A point should be highlighted to clarify this

statement. The tests referred to above are based on the overall fit of a distribution at

different sites (sample statistics tests). In contrast to other types of index flood methods

which use all the AMFS data available, the LFRM concept uses the 5 highest standardised

values to derive the regional growth curve. Therefore, the tests used here give little direct

information on the homogeneity of these data points used to fit the upper right-hand tail of

a regional distribution. In relation to homogeneity for the purpose of this analysis it was

found that there is insufficient evidence to reject the assumption of homogeneity of the

largest values in the regional sample. The Lu and Stedinger (1992) test maybe used that

looks at the annual exceedence probability 1 in 10 quantiles determined from a GEV

distribution to assess homogeneity in terms of the upper tail of the distribution. However

this was not applied in this study.

7.7 DEVELOPMENT OF THE LFRM MODEL FOR AUSTRALIAN FLOOD

DATA

The LFRM model allows for the estimation of large to rare flood quantiles for any site in a

region by exploiting flood data from other gauged sites in the region. The LFRM is based

on the assumption that the standardised maximum values of the annual maximum flood

series from a large number of individual sites in a region can be pooled (after standardising

to allow for the across-sites variations in the mean and CV values of the annual maximum

CHAPTER 7

193

floods) (Majone et al., 2007). The particular advantage of the LFRM is that, in contrast to

the commonly applied “index flood method”, it does not assume a constant CV across the

sites. This feature, in particular, allows the LFRM to pool data more effectively over a very

large region to allow estimation of large floods. An advantage of the LFRM proposed here

is that it offers an alternative to traditional approaches of large flood estimation methods

based on rainfall runoff models where time and resource constraints may not permit the

development of detailed rainfall based methods. Moreover, there is no guarantee that

rainfall –based methods provide the best possible estimates.

The main focus of the next few sections is to further develop the LFRM by (i) coupling it

with a spatial dependence model that reflects the reduction in the net information available

in regional analysis using spatially dependent data (Nandakumar et al., 2000); (ii) pooling

more data by taking the top 3-5 maximum values in a region; and (iii) combining it with

BGLSR and the region of influence (ROI) approach to develop regional prediction

equations so that the LFRM can be applied to ungauged catchments. Points i, ii and iii are

in essence the main innovations of the LFRM model being presented in this chapter and

Chapter 8 of the thesis.

7.7.1 DEVELOPMENT AND CALIBRATION OF THE LFRM MODEL

The selected Qmax(1,3 and 5) (i.e. the top 1, 3 and 5 maximum data points from each station’s

AMFS data, referred to as Qmax), are first standardised by the at-site average of the AMFS

data (mean), and then plotted in the (CV, Qmax/mean) plane. Figure 43 shows such a plot

for the study data set consisting of 626 data points (1 max), 1878 data points (3 max) and

3130 data points (5 max) from 626 sites, which suggests the following relationship:

CVcmeanQ /max (7.1)

The coefficients (c, and ) of Equation 7.1 were estimated by the maximum likelihood

approach for each of the plots in Figure 43. The estimated coefficients along with their R2

values are provided in Table 38.

CHAPTER 7

194

Table 38 Coefficients of non linear interpolation from Figure 43

Max

(Number of highest data points from the

AMFS) c R2 (%)

1 1 3.25 1.37 87

3 1 2.34 1.18 75

5 1 1.85 1.03 71

The R2 values in Table 38 suggest that the estimated coefficients provide a reasonably good

fit to the experimental data; this is more evident however when pooling the top 1 AMFS.

When pooling 3 and 5 top maxima, a greater scatter is noticed as can be seen in Figure 43;

this is also supported by the drop in R2 value. An important note is made here whether the

weaker relationship with CV is compensated for later on by having additional data points to

define the lower end of the distribution. What can be observed from Table 38 is that the

exponent is appreciably greater than unity (as would be the case for a Gumbe1

distribution for 1 max and 3 max) and decreases notably with the pooling of more data.

CHAPTER 7

195

2.52.01.51.00.50.0

14

12

10

8

6

4

2

0

CV(Q)

Qm

ax/m

ean

2.52.01.51.00.50.0

12

10

8

6

4

2

0

CV(Q)

Qm

ax/m

ean

2.52.01.51.00.50.0

12

10

8

6

4

2

0

CV(Q)

Qm

ax/m

ean

Max of 1

Max of 3

Max of 5

Figure 43 Scatter of Qmax/mean data in the (CV(Q), Qmax/mean) plane and non linear interpolation

function

Based on Figure 43 and assuming that a large part of the scatter can be explained by

variations in the average recurrence interval (ARI) of the AMFS data, the best way to

model the scatter is to search for a LFRM function in the form of:

CVARIfcmeanQ )(/max (7.2)

where it is assumed that )(ARIf is a function of the ARI only and can be substituted for the

coefficient . From Equation 7.2, the calibration procedure is based on the introduction of

a new standardised variable which can be defined by:

CHAPTER 7

196

CV

cmeanQY

)/( maxmax

(7.3)

where c and are based on the coefficients according to the number of annual maxima

pooled (e.g. 1, 3 or 5)

This form of standardisation (Equation 7.3) takes into account not only of differences in the

mean values but also of the CV, raised to the power appropriate for a specific regional data

set. As expected, as a result of this new standardisation, Ymax is practically uncorrelated

with the coefficient of variation, as is confirmed by the very small R2 values seen in the

plot of Figure 44 referring to the same set of data points for using the top 1 and 5 annual

maxima.

1 Max

y = -0.037CV(Q) + 3.28

R2 = 0.0003

-0.5

0.5

1.5

2.5

3.5

4.5

5.5

6.5

0 0.5 1 1.5 2 2.5 3

CV(Q)

Ym

ax

CHAPTER 7

197

5 Max

y = 0.161CV(Q) + 1.67

R2 = 0.0037

-0.5

0.5

1.5

2.5

3.5

4.5

5.5

6.5

0 0.5 1 1.5 2 2.5 3

CV(Q)

Ymax

Figure 44 Scattering of Ymax data in the (CV(Q), Ymax) plane and linear interpolation function for the

pooling of 1 (1 max) and 5 (5 max) top maxima

The following plotting position formula (Equations 7.4, 7.5 and 7.6), proposed by Majone

and Tomirotti (2004) was applied to estimate the ARI or the empirical non-exceedance

frequency of each of the Ymax values in the pooled data sets (i.e. max of 1, 3 and 5) from the

N = 626 sites.

In order to define the form of the distribution of the variable Ymax, the top 1, 3 and 5 annual

maxima values of each site’s data were used. Here the major assumption made is that the i-

th value of the series is independent of the others and that the normalised values obtained

applying Equation 7.3 (after standardising by the mean and coefficient of variation) belong

to the same population, and hence the plotting position of the Ymax can be provided by the

following empirical equation (Majone and Tomoirotti, 2004):

an

YYyPyP )()(ˆ (7.4)

where Equation (7.4) gives the probability of the maximum Y in a set (y) as the probability

of the individual observations Y in the set raised to the power of the number of

observations in the set, and na denotes the average sample size of all the at-site AMFS data

utilised in the analysis (e.g. na 34 for this study).

CHAPTER 7

198

Now, sorting the sample of N = 626, 1878, 3130 (based on the number of annual maxima

being pooled) and pooling the normalised values of Ymax in decreasing order, the value y

corresponding to ARI (return period, T years) has the following position (or rank) m in the

ordered sample:

])/11(1[)](1[)](1[ ˆaa nn

YYTNyPNyPNm (7.5)

For easier interpretation in terms of ARI, this can be rewritten as:

))/1(1/(1 /1 anNmARI (7.6)

where m is the rank of the observation in the pooled N, 3N or 5N Ymax data (i.e. 626, 1878

or 3130 data points), na is the average sample size and N the number of sites (assumed to be

independent in terms of maximum observed floods). From this definition, the estimated

ARI values may ideally be assumed to be representative of actual return periods. However,

this may not be the case for the Australian flood data set as many of the gauging sites are

very close together spatially and temporally (see Figures 16 and 37) and hence there would

be significant inter-site dependence within these observed AMFS data. Sections 7.8 and 7.9

look at this issue in more detail with the development of a spatial dependence model to

correct for the effective number of sites (Ne) which is currently assumed/incorporated in

Equations 7.4, 7.5 and 7.6.

.

The plot of Ymax vs. YT (where YT is the Gumbel reduced variate and is used as a surrogate

for ARI, where YT = -ln[-ln(1-1/T)]. A table of Gumbel variate values corresponding to

ARIs is given in Appendix D) is shown in Figure 45 for Ymax (N, 3N and 5N) sites. The

plots for 3N and 5N sites in Figure 45 are in line with what would be expected from using

the additional data points. Clearly the impact of using a greater number of maxima, e.g. 5

maxima, seems to provide a very smooth empirical distribution that is fitted closely by the

distribution function. These plots also reveal that the experimental data can be

approximated by a second degree polynomial function of YT as given by Equation 7.7,

whose model coefficients and R2 values can be seen in Table 39 for the different pooling of

the annual maxima (i.e. top 1, 3 and 5 maxima):

CHAPTER 7

199

32

2

1max )()( CYCYCY TT (7.7)

which in terms of Qmax/mean takes the following form:

CVCYCYCcmeanQ TT ))()((/ 32

2

1max (7.8)

Equations 7.7 and 7.8 yield the analytical expression of the LFRM model for the study data

set using the top 1, 3 and 5 annual maxima, where the appropriate values of the coefficients

in Table 39 are substituted into Equations 7.7 and 7.8. However, this formulation does not

allow for the effect of the inter-site dependence which in essence reduces the net

information available in any regional analysis (Nandakumar et al. 1997 and 2000). This can

be accounted for through the use of a spatial dependence model. The basic theory of inter-

site dependence and determining inter-site dependence are provided in this chapter (the

next few sections) while the development of a general spatial dependence model is

discussed and presented in Chapter 8.

Table 39 Coefficients and R2 values of Ymax polynomial interpolating from Figure 45

N sites - Ymax C1 C2 C3 R2 (%)

1 -0.027 0.80 0.49 0.997

3 -0.041 0.98 -0.18 0.998

5 -0.044 1.07 -0.59 0.999

CHAPTER 7

200

Figure 45 Frequency distribution of the standardised Ymax values

CHAPTER 7

201

7.8 EFFECTS OF INTER-SITE DEPENDENCE ON THE LFRM MODEL

This section presents the effects of inter-site dependence on RFFA in general; however, the

major aim is to develop a spatial dependence model to be used in application with the

LFRM concept being applied in Chapter 8 of the thesis.

As stated in sections 7.3 and 7.7.1, spatial dependence in AMFS data reduces the net

information available in any RFFA data set. Accordingly, the presence of spatial

dependence results in biased quantile estimates, because of the reduced number of

independent stations when the effects of inter-site dependence are considered.

This section begins with a brief introduction of the effective number of independent

stations concept (Ne). The estimation methods of Ne are then described. Finally, models for

the estimation of Ne are developed. The application of Ne with the LFRM using a

comprehensive Australian AMFS dataset is provided in Chapter 8.

7.8.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS

The introduction of the effective number of independent stations concept has been

introduced to quantify the effects of inter-site dependence (or also called spatial

correlation) on regional estimates of flood frequency distribution parameters. The value of

Ne depends on which specific distributional parameter is being estimated. The estimation of

Ne is usually based on two broad approaches (i) methods that use some form of regional

average parameters; and (ii) methods that pool annual maxima data. In the following

sections, approach (ii) is discussed in more detail; however, further information based on

approach (i) can be read in Alexander (1954); Stedinger (1983); Hosking and Wallis (1988)

and Nandakumar et al. (1997).

In the RFFA approaches which consider pooling of the standardised AMFS data from

several sites, time sampling is assumed to be substituted by space sampling. If the spatial

data were independent, each maximum value in the pooled data set could be assigned a

plotting position computed from the aggregated period of the record (the total record

length: L = Nna). This is often referred to as the typical “station-year method”. However,

the effective record length (Le) is invariability smaller than the total number of AMFS data

points in the pooled database because of the presence of spatial correlation.

CHAPTER 7

202

The effective record length of the pooled data set determines the position of the observed

annual maxima on a probability plot i.e. the associated frequency/ARI. Thus, the effective

number of stations for this approach can be defined such that Ne independent stations

should provide the same record length as N spatially dependent stations. Thus, Ne is defined

as the ratio of Le and the average record length over all the stations ( an ).

a

ee

n

LN (7.9)

As Ne determines the position of a data point (in the pooled annual maxima data set) on a

probability plot, any error in this measure of spatial dependence in the AMFS data would

introduce a bias in the final flood quantile estimates.

7.8.2 REGIONAL MAXIMUM FLOOD AT A NETWORK OF SITES - REGIONAL

MAXIMUM AND TYPICAL CURVES

This section begins with the analysis of the AMFS data observed at one or more networks

of sites. A network corresponds to gauged sites; however, in application, the network can

be any group of sites for which a large flood estimate is sought.

Let us visualise a hypothetical network consisting of four sites, all with the same period of

records (i.e. satisfying the concurrent record length criterion). The maximum flood data

points from each of these four sites can be pooled to form a series of the “maximum of 4”.

The data points should be standardised (e.g. by Equation 7.3) before the largest values are

picked so as to give each of the sites an equal chance of providing the maximum values to

the new maxima series.

Once the annual maxima series is constructed, statistical techniques such as L-moments are

used to fit a “regional maximum of 4” flood growth curve. Dales and Reed (1989) state “it

is difficult to devise a terminology that encapsulates the general meaning of these growth

curves without being clumsy”. The typical curve is an average standardised point flood

growth curve for a particular geographical region, which is produced by averaging the

parameters of the ditributions fitted to individual sites. The regional maximum curve is a

standardised flood growth curve associated with the maximum flood experienced at a

CHAPTER 7

203

network of N sites located within a geographical region. The term “regional maximum” is

used to highlight that the ‘maximum data series used here’ is over space rather than time

(see section 7.8.1 for more details) and it can also be thought of as “network maximum”.

However, as will be shown in the later sections it is of interest to consider generalised

networks of sites within a given geographical region and it was for this reason that the

terminology “regional maximum” was finally adopted by Dales and Reed (1989). Further

information can be read in Dales and Reed (1989).

7.8.3 FACTORS INFLUENCING THE REGIONAL MAXIMUM

The regional maximum growth curve as defined above for N sites (N > 1) is expected to lie

above the typical regional growth curve. However, there is an exception when sites are

closely grouped together such is the case when there is perfect correlation between the

annual maxima of the individual sites. The position of the regional maximum growth curve

in relation to the typical growth curve is influenced by the number of sites in the region in

question, the scattering of the sites and by the system inputs (e.g. rainfall, baseflow,

evaporation and other meteorological factors) and outputs and physicality of the

catchments in the region. Many classifications can be used to gain an understanding of the

regional maximum curve in relation to the typical curve, in this study the major influences

are indexed by the number of sites N, the region being analysed and the average correlation

coefficient between the number of sites in a region or network.

7.8.4 NUMBER OF SITES, N

For a given network, within a region, the magnitude of the regional maximum growth

curve would clearly depend on the number of sites, N, from which it is drawn. For

example, 8 sites in a network of the Australian AMFS dataset seem to capture a reasonable

average concurrent record length (i.e. 18 years) as shown in Figure 46. As the network size

N increases e.g. N = 32, the average concurrent record length decreases, therefore not

picking up the required variations from site to site in the network, which makes it

unsuitable to derive a suitable regional maximum growth curve. The “regional maximum of

8” growth curve would therefore lie above the “regional maximum of 4” growth curve for a

given network within a specific region. Hence, the maximum network values used in this

study are taken to be N = 2, 4 and 8 to define the regional maximum and typical growth

curves.It should be noted the above comments are relevant to the proposed methodology

for deriving the regional maximum and typical curves. For application of the LFRM there

CHAPTER 7

204

is no need for the series at different sites to be concurrent, as long as the assumption of

stationarity is satisfied.

7.8.5 CROSS CORRELATION

The position of the regional maximum growth curve in relation to the typical growth curve

is governed by the degree of cross correlation between the individual site’s AMFS data.

This cross correlation may be highly variable for different paired sites/gauges, and hence

dependence between sites can be seen in terms of an inter-site correlation-distance

relationship. Figure 39 shows this sort of relationship for 131 gauging sites in the state of

VIC. It can be seen that there are significant correlations even at greater distances for the

VIC data, which implies that there is indeed notable spatial dependence present. This was

observed for all the states of Australia.

Correlation is indeed a useful index for measuring dependence between AMFS data at two

sites, it is also considered as a relatively useful measure of dependence when looking at a

group of N sites. It seems logical that correlation as a measure of dependence needs to be

developed for a particular region, or a network of sites within a region. In this study a

representative value of correlation for a region is selected being the mean value. The

definition of a typical region for analysis in this context is given below in section 7.8.7.

29

23

1816

11

0

5

10

15

20

25

30

35

N = 2 N = 4 N = 8 N = 16 N = 32

Number of Sites in a Network

Ave

rage

Con

curr

ent R

ecor

d Le

ngth

Figure 46 Average concurrent record lengths for different network sizes

CHAPTER 7

205

7.8.6 DEFINITION OF A REGION FOR ANALYSIS

Consider a region in which N = 2, 4 or 8 site/gauge networks could be picked. Obviously

there are many ways in which these networks could be selected. For this analysis, an

extensive experiment is required to establish a measure of the typical degree of dependence

in networks of size of N = 2, 4 and 8. In the experiment, each state was considered as a

single region, except NSW, VIC and QLD which were combined into one region as the

stations in these states form a contiguous region in geographical space.

7.8.7 METHODS OF SAMPLING REGIONAL MAXIMA

In this analysis it was necessary to adopt a flexible approach to sampling regional flood

maxima for different network sizes within a specified region. Here, three distinct methods

or experiments were adopted: (i) ROI network method, (ii) random ROI network method

and (iii) a totally random network method. It should be noted here that the main aim is to

establish a “regional maximum of N” growth curve which can be associated with a given

network size and region, and which can be considered representative of the flood region

under study. It is assumed in the following explanations that the floods for each gauged site

have been standardised according to Equation 7.3. A brief explanation of the experiments

undertaken is given in the following sections.

Given that the real data (i.e. Australian AMFS dataset used here) has issues relating to

sampling variability and homogeneity; with this in mind, simulated data was generated and

used in the experiments as well which provides control over issues relating to sampling

variability and homogeneity in the investigation. More detail about the generated dataset is

given in section 7.10.

7.8.8 ROI AND RANDOM ROI NETWORK METHODS

In this case, a focal point (i.e. a streamflow gauging site) is established in a region. Once

this is selected, a network of N gauges is chosen based on the closest N sites (the distance

criteria used for the ROI is based on geographical distance) to the focal point (more detail

about the ROI approach can be seen in section 3.7). Once selected, the regional maxima are

formed for those years for which N gauges have valid annual maxima. The GEV

CHAPTER 7

206

distribution is fitted to the regional maximum series. This procedure is repeated for every

site in the region, hence yielding a different regional maximum curve. A regional average

curve was determined for each network in the same way. For the random ROI network

method a focal point is established in the region. Once selected, the closest 20 stations to

the focal point are pooled and a network of N gauges is selected randomly from the 20

sites. The rest of the steps are presented with the ROI network approach.

7.8.9 THE TOTAL RANDOM NETWORK METHOD

The random method can be considered to be more flexible in that a different set of N sites

can be selected at each iteration for the region under consideration. If not all the sites in an

iteration have valid annual maximum flood data, a further random set of N sites is selected.

Because of the random nature of the method, it is desirable to carry out a number of

repetitions and to average the results, which is what was done in this study.

7.8.10 COMPARING SAMPLING METHODS

The main differences between the sampling approaches are that:

The ROI and random ROI network methods give more information about the

variability within a region and are more useful when investigating small networks

which are highly correlated. It is also noted, the ROI networks would tend to bias

the networks towards high correlation values.

The ‘total random network’ is more likely to make use of longer record lengths; if

one of the N sites does not have a annual maximum flood value for the years in

question, another set of sites is selected instead. Importantly, the total random

network approach averages the results over the region in a more statistically

meaningful manner than the ROI network method and is likely to sample over a

broader range of correlation values.

In all, 8,292 experiments were carried out on real and simulated datasets. The results

associated with the experiments above are discussed in detail in Chapter 8.

It should be remembered that the above approach is adopted in the light of providing a

reasonable inference on spatial dependence. Spatial dependence between annual maximum

floods can be complicated by differing response characteristics of catchments. However, it

CHAPTER 7

207

is generally accepted that physical differences between catchments become less influential

at higher return periods.

7.9 MEASURES OF Ne – EFFECTIVE NUMBER OF INDEPENDENT

STATIONS

The main objective of this study is to assess the degree of spatial dependence in annual

maximum floods, so that this can be taken into account with the LFRM model. In most

cases, some generalisation must be achieved so that these assessments can be made for

networks and ungauged sites. Generalising spatial dependence using a spatial dependence

model is discussed in Chapter 8. As a precursor to defining a spatial dependence model one

must explore ways in which the “regional maximum and typical growth curves” can be

compared for the flood data used (both real and simulated). Given the high number of

experiments carried out, the use of a summary index by which the regional maximum

curves can be related to their typical curve counterparts is also explained.

Three such indices that may be considered are: the epicentrage coefficient (Galea et al.

1983), Buishand’s dependence function method (Buishand, 1984) and the effective number

of independent stations (Dales and Reed, 1989 and Nandakumar et al. 1997 and 2001). This

study concentrates on the ‘effective number of independent stations’ concept.

7.9.1 EFFECTIVE NUMBER OF INDEPENDENT STATIONS, Ne

An alternative approach to indexing the position of the regional maximum curve relative to

the typical curve is to examine their horizontal separation on a Gumbel probability plot,

indexing this by an effective number of independent stations (Dales and Reed, 1989), Ne.

Consider the AMFS for N gauges (stations/sites) from a homogeneous region, so that these

are identically distributed as Ft(x). Ft(x) is the distribution function of the typical growth

curve. Thus:

x)(X...x)(Xx)(X(x)F 21t Nprobprobprob (7.10)

If there is spatial independence, i.e. if the AMFS data at the N gauges are entirely

independent, the distribution of the regional maximum floods of the N gauges is given

simply by:

CHAPTER 7

208

N

Nr prob (x)].[Fx))X...,X,(max(X(x)F t21 (7.11)

If, however there is complete dependence (i.e. when there is perfect correlation between the

stations AMFS), the distribution function for the regional maxima would be:

(x)][F(x)][F tr (7.12)

In real world problems, there would always be partial dependence and the degree of

dependence will vary at different quantiles, x. This is recognised by defining an effective

number of independent stations Ne (x), such that:

(x)N

te(x)][F(x)][F r (7.13)

Thus:

(x)lnF/(x)lnF(x)N tre (7.14)

and

(x))lnFln((x))lnFln((x)lnN tre (7.15)

It is simply seen that (x)lnNe is the horizontal separation of the regional maximum and

typical growth curves on Gumbel probability scale and can be seen in an example plot in

Figure 47 and Equation 7.16, i.e.:

rte XX(x)lnN (7.16)

If the assumption is followed that the degree of spatial independence cannot be less than

total dependence (Ne =1) nor greater than complete independence (Ne =N), the following is

expected:

N(x)N1 e for all x. (7.17)

CHAPTER 7

209

Figure 47 Example plot of regional maximum and typical growth curves and the effective number of

independent stations on a Gumbel plot for a random network of 2 and 4 gauging sites in Tasmania

7.9.2 A SIMPLE MODEL FOR Ne

In this study a relatively simple model of spatial dependence was obtained by ignoring the

possible variation of Ne with ARI. Hence the representation of spatial dependence reduces

to fitting a one-parameter model to relate the position of the regional maximum to the

typical growth curve. The one parameter by all means is Ne.

As reported by Dales and Reed (1989) the maximum of Ne independent GEV distributions

– where Ne is some constant – is a GEV with the following parameters:

tettrtN /)1( (7.18)

t

etr N (7.19)

tr (7.20)

Eliminating Ne, and setting tr , we have

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

-1 0 1 2 3 4

Gumbel reduced variate (Y T)

Sta

ndar

dise

d an

nual

max

ima

1 10 100ARI (Years)

Max of 2

Max of 4

Typical curve

Regional maximum values for anetwork of 4 stationsXr

lnNe

Xt

CHAPTER 7

210

// ttrr (7.21)

This condition implies that the lower or upper bound of the regional maximum of Ne

independent sites coincides with that of the typical growth curve, i.e.

boundttrr x // (7.22)

7.9.3 FITTING Ne BY THE MEAN

Since only one parameter is to be fitted, only the first probability-weighted moment, o , is

required. This is simply the arithmetic mean of the AMFS data, For a GEV distribution

with parameters , and the theoretical (i.e. population) mean is:

/)1(1 o (7.23)

If we apply estimates derived from the regional maximum and typical data, we have:

/)1(1 rr

r

o (7.24)

/)1(1 tt

t

o (7.25)

Hence, applying Equation 7.22 and eliminating )1( term, we obtain:

)/()(/ bound

t

obound

r

otr xx (7.26)

Finally, from Equation 7.26 the following expression is obtained:

/1)]/()[( bound

t

obound

r

oe xxN (7.27)

By standardisation we have t

o = 0, and r

o is simply the arithmetic mean of the regional

maximum values.

CHAPTER 7

211

7.10 SIMULATED DATASETS

As discussed earlier, given the limitations of the real data set to give clearly meaningful

results because of issues with sampling variability and homogeneity, it was decided to

generate synthetic datasets for each of the regions with known population correlation

coefficients. There are two important aspects of this simulation exercise: (i) to compare the

effective number of stations Ne, with those of the real dataset, and to identify any major

differences by having some control over the issues of homogeneity and sampling

variability, and (ii) when deriving a spatial dependence model for practical use the

simulated data will provide insight into identifying a suitable model function (this is

discussed in more detail in Chapter 8).

7.10.1 SYNTHETIC DATA GENERATION

For the generation of AMFS data, it was assumed that:

(i) generated data come from the same population,

(ii) data from different years are spatially independent (i.e. a particular year’s data

for a given site is not correlated with other year’s data for other sites) and

(iii) data from the same year are dependent with a given degree of serial correlation.

To represent the region’s data, the regional average standardised GEV distribution

parameters (for each state/region of Australia) of the AMFS were used in data generation.

The multi-site maxima were generated according to the following steps.

(i) For a given correlation coefficient, a vector of random multivariate normal

deviates with zero mean and a covariance matrix whose elements are the

constant cross correlations and standard deviation of the standardised regions

data is generated using the Matalas (1967) method.

CHAPTER 7

212

(ii) The normal variates vector is transformed to a GEV distribution with the

regional average standardised GEV distribution parameters of the particular

state or region.

In an effort to counteract sampling variability due to a limited record length, sequences of

annual maximum flood data for a region with 51 stations, each having a record length of

1000 years was generated. The (constant) correlation coefficient between the AMFS data

from different stations was varied from 0.0 to 0.5 in steps of 0.1. An example figure is

given in Figure 48 of a generated data set with constant correlation coefficients of 0.0 and

0.5 for TAS state. In all, 500 replicates of regional data (each replicate consists of data for

51 stations) were generated for each constant correlation coefficient.

CHAPTER 7

213

Figure 48 Example plot of generated data with different constant correlation coefficients for the state

of Tasmania

Table 40 gives the GEV distribution parameters for the parent distributions used in the data

generation and the mean parameters for the generated data for each of the regions. The

parent distribution parameters used to generate the data seem to be reasonably well

preserved by the generated model. The correlation coefficients (ρ) were not as well

preserved as the parameters, as ρ was not directly introduced in the GEV data generation

(correlated standard normal deviates were generated and then transformed to a GEV

distribution). In any case, the strict preservation of a particular ρ is not that important for

this analysis; the essential requirement is to know what the average ρ value is particularly

when generalising the spatial dependence model (see Chapter 8). This ρ is then assumed to

represent the population correlation coefficient.

CHAPTER 7

214

Table 40 Comparison of the parameters of the parent distribution and the distribution for

the generated data (distribution: F(x)=exp[-1-(x-)/1/]) and correlation coefficient, ρ.

Parameters

Region ρ

NSW+QLD+VIC Parent Gen. Parent Gen. Parent Gen. Parent Gen.

0.00 -0.0022 -0.488 -0.493 0.652 0.656 -0.149 -0.156

0.10 0.086 -0.488 -0.491 0.652 0.655 -0.149 -0.152

0.20 0.172 -0.488 -0.495 0.652 0.654 -0.149 -0.158

0.30 0.267 -0.488 -0.492 0.652 0.665 -0.149 -0.151

0.40 0.357 -0.488 -0.491 0.652 0.663 -0.149 -0.149

0.50 0.451 -0.488 -0.492 0.652 0.663 -0.149 -0.153

TAS 0.00 0.00023 -0.574 -0.576 0.982 0.988 -0.0073 -0.0067

0.10 0.094 -0.574 -0.5749 0.982 0.978 -0.0073 -0.0062

0.20 0.175 -0.574 -0.579 0.982 0.982 -0.0073 -0.004

0.30 0.287 -0.574 -0.571 0.982 0.979 -0.0073 -0.007

0.40 0.385 -0.574 -0.562 0.982 0.978 -0.0073 -0.008

0.50 0.481 -0.574 -0.575 0.982 0.986 -0.0073 -0.006

WA 0.00 0.0003 -0.500 -0.500 0.685 0.682 -0.158 -0.162

0.10 0.082 -0.500 -0.495 0.685 0.683 -0.158 -0.151

0.20 0.173 -0.500 -0.494 0.685 0.693 -0.158 -0.159

0.30 0.264 -0.500 -0.508 0.685 0.689 -0.158 -0.160

0.40 0.356 -0.500 -0.496 0.685 0.687 -0.158 -0.155

0.50 0.461 -0.500 -0.510 0.685 0.685 -0.158 -0.160

NT 0.00 0.0089 -0.503 -0.505 0.755 0.748 -0.0831 -0.0836

0.10 0.089 -0.503 -0.502 0.755 0.751 -0.0831 -0.0833

0.20 0.169 -0.503 -0.499 0.755 0.759 -0.0831 -0.0841

0.30 0.283 -0.503 -0.503 0.755 0.748 -0.0831 -0.0827

0.40 0.375 -0.503 -0.502 0.755 0.752 -0.0831 -0.0867

0.50 0.454 -0.503 -0.507 0.755 0.763 -0.0831 -0.0819

SA 0.00 0.0009 -0.496 -0.496 0.753 0.750 -0.0762 -0.0761

0.10 0.083 -0.496 -0.493 0.753 0.755 -0.0762 -0.0762

0.20 0.189 -0.496 -0.493 0.753 0.751 -0.0762 -0.0751

0.30 0.253 -0.496 -0.491 0.753 0.753 -0.0762 -0.0776

0.40 0.355 -0.496 -0.489 0.753 0.753 -0.0762 -0.0762

0.50 0.484 -0.496 -0.497 0.753 0.753 -0.0762 -0.0752

*Gen. = generated data

CHAPTER 7

215

7.11 SUMMARY

The main steps in this chapter can be summarized as follows. On the onset of this chapter

the LFRM concept was discussed briefly and the issue of inter-site dependence was

introduced and discussed in the light of the application of the LFRM. The chapter also

describes the comprehensive Australian AMFS dataset and the quality checks undertaken

to make the data suitable for use with such an application.

Identifying an appropriate probability distribution is an important step in deriving a general

spatial dependence model. In this chapter, different goodness-of-fit tests were used to

establish a suitable distribution to describe the AMFS, which included the L-moment ratio

diagram, the DISTZ statistic of Hosking and Wallis (1991), the Anderson-Darling (AD)

Monte Carlo simulation test and visual inspections. It was found that the GEV distribution

was the most appropriate to approximate the AMFS data. Testing for homogeneity was also

undertaken using the homogeneity test of Hosking and Wallis (1993) and the Bootstrap AD

test. Both tests showed that strict homogeneity could not be established for any of the

Australian states or Australia as a whole. In relation to homogeneity for the purpose of this

analysis it was found that there is insufficient evidence to reject the assumption of

homogeneity of the largest values in the regional sample.

This chapter then described the development of the LFRM for the Australian dataset

allowing for spatial dependence. The LFRM as outlined in this chapter has successfully

enhanced the method introduced by Majone et al., (2007) by using up to 5 maximum flood

values from each site (rather than just the largest value). The results and derived formulae

were given and discussed in some detail. Given that spatial dependence reduces the net

information for RFFA, the effects of inter-site dependence on the LFRM were discussed in

detail based on the “effective number of stations (Ne) concept”. Methods for pooling

recorded flood data, the issues regarding regional maximum floods at a network of sites,

influencing factors on the regional maxima and cross-correlation were introduced and

discussed in detail. This then provided the motivation to present the theory for the methods

for defining and determining a region for analysis which also included sampling regional

maxima. Furthermore, this chapter also discussed three network sampling methods which

were used in this study; they were the ROI, random ROI and total random networks. The

methodology for estimating Ne based on the GEV distribution was then described as

CHAPTER 7

216

outlined in Dales and Reed (1989). Finally, given the limitations of the real data set to give

clearly meaningful results in relation to the derivation of Ne because of issues with

sampling variability and homogeneity, it was decided to generate synthetic datasets for

each of the regions for use in the analysis.

CHAPTER 8

217

CHAPTER 8: APPLICATION OF LFRM IN THE LIGHT OF

SPATIAL DEPENDENCE – RESULTS AND DISCUSSION

8.1 GENERAL

This chapter begins by looking at the detailed results and the typical behaviour of the

number of independent sites (Ne) for both the real and simulated datasets. The chapter then

goes on to describe how a general model for spatial dependence was achieved. A detailed

discussion is also provided on the generalised spatial dependence model.

Finally, the large flood regionalisation model (LFRM) is revisited for the Australian

continent in the light of spatial dependence (i.e. LFRM combined with the developed

spatial dependence model). The LFRM is then coupled with Bayesian generalised least

squares regression (BGLSR - to estimate the mean and coefficient of variation (CV) of the

AMFS data) to estimate large to rare floods for gauged and ungauged catchments. A split-

sample validation is undertaken to compare the results of the LFRM with established

methods such as the parameter regression technique (see Chapters 3 and 5) and

international methods on large floods (i.e. World Model).

8.2 RESULTS FOR Ne

Following the procedures described in sections 7.8.7 to 7.8.10 the different network

methods were used to establish an indication of the typical degree of dependence in

network sizes of N = 2, 4 and 8. This was carried out on the real and simulated datasets

with the main purpose of describing the typical spatial dependence in each region/state

separately.

The Ne values were obtained for these different network sizes by fitting the mean as

described in section 7.9.2 and 7.9.3 and are detailed in Tables 41 and 42 for the real and

simulated datasets for the different networks and regions. It can be seen that the total

random network exhibits less spatial dependence than both the ROI and random ROI

networks. This finding is not surprising as sites that are closer together are more likely to

show more spatial dependence. This can be seen in all the regions when comparing the Ne

values for the different N-sized networks.

CHAPTER 8

218

Table 41 Experimental values of Ne for different networks and regions using the real data

(average Ne over the experiment reported)

Real data set

Networks /

Region

Number of gauges (sites), N

ROI &

RANDOM ROI

2 4 8

NSW+QLD+VIC 1.74 3.03 5.36

TAS 1.60 2.55 4.04

WA 1.62 2.72 4.43

NT 1.72 2.89 5.09

SA 1.50 2.20 3.21

TOTAL

RANDOM


2 4 8

NSW+QLD+VIC 1.90 3.66 6.80

TAS 1.83 3.30 5.87

WA 1.88 3.59 7.00

NT 1.81 3.40 6.61

SA 1.66 2.59 3.93

Importantly the same features as above can be seen in the simulated data; however the

simulated data shows less spatial dependence in the ‘total random network’ as compared to

the real dataset. What is worth noting here, in the case of the simulated data, is that for

most of the regions at networks of size 8, there is more of a tendency to independence than

the smaller network sizes, this is more evident in the total random network. From Tables 41

and 42 it can be seen that all the regions and the different N-sized networks that the spatial

dependence in SA and TAS is more severe. This result coincides with these regions being

much smaller than the other regions examined here, and as such the sites are located within

a closer proximity of each other. Overall, the results of the simulated datasets are in

agreement with the real data, which is pleasing.

CHAPTER 8

219

Table 42 Experimental values of Ne for different networks and regions using the simulated

data (average Ne over the experiment reported)

Simulated data

set

Networks /

Region


ROI &

RANDOM ROI

2 4 8

NSW+QLD+VIC 1.75 2.89 4.81

TAS 1.71 2.88 4.75

WA 1.73 2.94 4.91

NT 1.73 2.93 4.88

SA 1.60 2.55 4.18

TOTAL

RANDOM


2 4 8

NSW+QLD+VIC 1.93 3.66 6.96

TAS 1.93 3.71 7.08

WA 1.94 3.74 7.20

NT 1.94 3.73 7.18

SA 1.74 3.01 4.72

8.3 A CLOSER LOOK AT THE BEHAVIOUR OF Ne

Continuing on the discussion from the above, the Ne values were more closely analysed. It

was noted throughout the experiments that violations of the constraint NNe was a

recurring feature especially for the 2 and 4 gauge networks and less frequently with the 8

gauge networks for all the regions. This was more noticed with the real dataset than the

simulated data. For the real data set the worst of the violations happened with the total

random network. Figure 49 provides an example illustration of these violations for the

NSW+QLD+VIC region (the results for the other states can be seen in Appendix C). The

top three plots show the results associated with the real data, while the bottom three plots

illustrates the simulated data. The first 400 experiments depict the results of the ROI and

random ROI networks, while the last 400 experiments represent the total random sampling

experiments. With the real dataset, it can be clearly seen that there is a distinct change in

the pattern of Ne with experiment number, where the ROI and random ROI clearly show

CHAPTER 8

220

that there is more spatial dependence between sites. Similar results were obtained for the

other regions except SA. These results are also supported by the simulation results where it

can be seen that stations with a low average correlation coefficient are usually spatially

independent and in some cases violate the NNe condition as well. With the simulated

datasets this usually occurred when the average correlation was negative. However it can

be seen from Figures 49 and 50 that the violations were less frequent as the network size

was increased. Figure 50 provides the histogram of the number of times NNe showing

up in a particular class size. Here the results for the real and simulated datasets are provided

for the NSW+QLD+VIC region. (The results for some of the other states can be seen in

Appendix C). Indeed, it can be observed that there are many places where the

NNe condition is not satisfied. This is more noticeable for the real dataset (top three

plots) as a wider range of cross correlation is experienced as compared to a controlled

simulation. It was noticed that there was a reasonable number of very low and negative

average correlation coefficients in the analysis; this was observed for all the regions’

analysed. This raises the question of the possibility of negative dependence in the real data

set, while this has not been looked at closely in this analysis, it would be worthwhile doing

a closer investigation on this issue at a later stage. In any case, given the low concurrent

record length between sites and the inherent assumptions in the modelling, some of these

violations may be attributed to symptomatic limitations in this GEV-based method.

Therefore it seems possible that the violations are just due to sampling effects and the fact

that the data has been standardised by the mean and CV (Equation 7.3, Chapter 7) when

estimating Ne. Dales and Reed (1989) arrived at similar conclusions, however, Dales and

Reed (1989) standardised the data by the mean as per index flood approach.

CHAPTER 8

221

8004000

2.5

2.0

1.5

1.08004000

5

4

3

2

18004000

8

6

4

2

3001500

2.0

1.9

1.8

1.7

3001500

4.0

3.5

3.0

2.53001500

8

7

6

5

4

N = 2

Experiment Number

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

NSW+ QLD+ VIC

Varying correlation coefficient from 0 to 0.5

Figure 49 Variation of Ne with different network methods and experiment number for

NSW+QLD+VIC region (top panel for real data and bottom panel for simulated data)

2.62.42.22.01.81.61.41.2

60

45

30

15

04.84.23.63.02.41.81.2

80

60

40

20

08.757.506.255.003.752.501.25

48

36

24

12

0

2.041.981.921.861.801.741.68

40

30

20

10

04.23.93.63.33.02.72.4

40

30

20

10

08.88.07.26.45.64.84.03.2

30

20

10

0

N = 2

Freq

uenc

y of

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

1000103

5

9

2123

41

53

4548

4343444645

36

29

37

2831

36

20

109

22 102

6

12

29

41

73

6768

47

39

47

69

54

68

44

34

108

22

15

2623

26

42

38

48

333330

14

2219

14

26

32

3940

29

47

40

17

965

1

6

23

1817

21

16

24

40

28

1514

20

12

7

25

16

3

11

33

8

15

31

11

27

17

7

34

10

30

21

4

38

9

1

19

26

5

01

29

19

2

6

30

15

0

8

34

9

0

2625

0

22

29

NSW+ QLD+ VIC

Figure 50 Frequency of Ne with different network methods for NSW+QLD+VIC region (top panel for

real data and bottom panel for simulated data)

CHAPTER 8

222

While a constant Ne model was assumed for use with the LFRM, further investigations

were carried out that looked at the possible variation of Ne with respect to ARI for the same

set of experiments but only focussing on the real dataset. Table 43 summarises the results

for the different sized networks and regions. It can be noticed that for the larger regions

(NSW+QLD+VIC and WA) the degree of spatial dependence is broadly similar and that

spatial independence is reached at relatively low ARIs. The smaller regions, or regions

where stations are closely clustered (TAS, SA and to a lesser extent NT) show more

dependency as slightly higher ARIs are associated with these regions to reach

independence; this is the case for TAS and NT (see Table 53). However it can be seen that

SA never reaches independence for a particular ARI, which suggests that these stations are

highly cross correlated. If one looks at the location of the stations in SA (see Figure 16)

they are found to be in very close proximity of each other.

Table 43 Experimental results in which Ne exceeds N at a particular ARI for different

regions using the real data set

Real data set

Networks /

Region

Number of gauges (sites), N and ARI at which Ne=N

2 4 8

NSW+QLD+VIC 5.9 9.9 13.9

TAS 7 23.2 37.6

WA 5.7 8.2 9.3

NT 11.9 15.5 28.4

SA * * *

*SA never reaches independence

Overall, in analysing the real and simulated data experiments, the evidence available

suggests that spatial dependence for the Australian AMFS data reduces with larger regions,

networks and ARIs, where as for the smaller regions spatial dependence is more evident.

Hence, it is noted that the overall modelled Ne values may be inherently uncertain (i.e.

when applied to estimate large ARIs), which would inturn overestimate the ARI of interest.

However, when put into perspective, at present a hydrologist is only able to make an

estimate of a large design flood with an associated ARI based on the assumption that all the

sites in a region are totally independent, which indeed would lead to underestimation of the

ARI of interest. Therefore, the analysis undertaken in this study should only be seen as

CHAPTER 8

223

providing a new framework of risk assessment for large to rare flood estimation rather than

a perfect answer. As such, this approach can be expected to provide reasonably accurate

risk assessments at the higher ARIs, which are of interest in the application of the LFRM

model.

8.4 GENERALISING THE Ne MODEL

Deriving a general model of spatial dependence is not straightforward. If one places too

much emphasis on a particular aspect of the experimental results, it may result in many

regional sub models, which would introduce significant regional variations when the spatial

dependence models are applied. In this case a regional approach is still warranted, where a

suitable model is used to describe the spatial dependence in each region/state separately

and then combining the results to frame one relationship to use for all of Australia. In this

study, a relatively simple model of spatial dependence was obtained by ignoring the

possible variation of Ne with ARI.

Regression analysis using unweighted ordinary least squares is used to relate Ne to the

average correlation coefficient (ρ) of concurrent AMFS at pairs of stations for the different

networks and regions for each of the adopted 8,292 experiments (this includes the real data

and simulated data). To derive the regression equation it was determined to be more

appropriate to build a general model that relates the ratio lnNe/lnN to the average

correlation coefficient (ρ). Dales and Reed (1989), showed that the ratio lnNe/lnN provides

a neat index of the degree of spatial independence in annual maximum data, the index

ranging between 0 (total dependence) and 1 (total independence). The derived spatial

dependence models and the regression analysis are provided below.

8.4.1 CONSTANT Ne MODEL – AN EMPIRICAL RELATIONSHIP FOR Ne

BASED ON AVERAGE CORRELATION COEFFICENT (ρ)

The form of constant Ne model is given by Equation 8.1 which was calibrated by

combining all the models for each of the Australian states into one generic equation. The

final form of Equation 8.1 was identified by investigating the real and simulated data sets:

baN

Ne ln

ln (8.1)

CHAPTER 8

224

In all the regions, the one variable model (see Equation 8.1) provided a relatively good fit

to the experimental data. The fitted parameters of the constant Ne model for all the states

individually and Australia (overall) are given in Table 44 for the real and simulated

datasets. The final values as seen in Table 44 are the average coefficient values over the

networks and experiments for each study region. The final parameter values for the general

Australian spatial dependence model was found by combining the different network values

of the ratio lnNe/lnN and developing a regression equation of the form represented by

Equation 8.1 and then taking the average of the coefficient values of the developed

regression equations. Figures 51 to 53 show the typical results for each network in

determining the final Australian spatial dependence model with the real dataset. A similar

procedure was carried out for the simulated dataset.

1.000.750.500.250.00-0.25-0.50

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

Average correlation coefficient

ln(N

e)/ln

(N)

S 0.0726382R-Sq 88.7%

Regression95% CI

N = 2

CHAPTER 8

225

5.02.50.0-2.5-5.0

99.99

99

90

50

10

1

0.01

St andardised Residual

Pe

rce

nt

1.21.00.80.60.4

4

2

0

-2

-4

Fit t ed Value

Sta

nd

ard

ise

d R

esi

du

al5.003.752.501.250.00-1.25-2.50

160

120

80

40

0


Fre

qu

en

cy

1200

1100

100090

080

070

060

050

040

030

020

010

01

4

2

0

-2

-4

Observat ion Order

Sta

nd

ard

ise

d R

esi

du

al

Normal Probabilit y Plot Versus Fit s

Histogram Versus Order

Residual Plots (N = 2)

Figure 51 Regression results of the N = 2 network combining the lnNe/lnN ratio values for all the

Australian states/regions and experiments

1.00.80.60.40.20.0-0.2-0.4

1.2

1.0

0.8

0.6

0.4

0.2

Average correlation coefficient

ln(N

e)/ln

(N)

S 0.0597907R-Sq 88.7%

Regression95% CI

N = 4

CHAPTER 8

226

5.02.50.0-2.5-5.0

99.99

99

90

50

10

1

0.01


Pe

rce

nt

1.21.00.80.60.4

5.0

2.5

0.0

-2.5

-5.0

Fit t ed Value

Sta

nd

ard

ise

d R

esi

du

al

3.752.501.250.00-1.25-2.50-3.75-5.00

160

120

80

40

0


Fre

qu

en

cy

1200

1100

100090

080

070

060

050

040

030

020

010

01

5.0

2.5

0.0

-2.5

-5.0

Observat ion Order

Sta

nd

ard

ise

d R

esi

du

al


Hist ogram Versus Order




1.000.750.500.250.00

1.2

1.1

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Average correlation coeffcient

ln(N

e)/ln

(N)

S 0.0676152R-Sq 83.6%

Regression95% CI

N = 8

CHAPTER 8

227

5.02.50.0-2.5-5.0

99.99

99

90

50

10

1

0.01


Pe

rce

nt

1.00.80.60.4

5.0

2.5

0.0

-2.5

-5.0

Fit t ed Value

Sta

nd

ard

ise

d R

esi

du

al

1200

1100

100090

080

070

060

050

040

030

020

010

01

5.0

2.5

0.0

-2.5

-5.0

Observat ion Order

Sta

nd

ard

ise

d R

esi

du

al

3.752.501.250.00-1.25-2.50-3.75

160

120

80

40

0


Fre

qu

en

cy


Hist ogram Versus Order




Figures 51 to 53 show that the standard error (s – on the graph) associated with regression

equations is quite modest which suggests that the most of the variability in the lnNe/lnN

ratio can be well explained by the average correlation coefficient in a network of sites, this

can also be seen by the narrower 95% confidence interval on the prediction. It is also

observed that there are some outliers in the analysis with the standardised residuals being

close to the 5 limit. These values were further examined by removing them from the

analysis; however, removing these values did not provide any further benefit. Hence, the

outliers in the regression analysis were finally retained.

The coefficient of determination (R2) values for the final models (see Table 44) fitted to the

real and simulated data sets are quite high, suggesting that the use of the constant Ne model

should result in improved Ne estimates compared to the values calculated directly from the

AMFS data in each station network (This is more apparent however for the simulated

spatial dependence model). The comparison of the fitted Ne values for the real and

simulated data computed using Equation 8.1 and that by the spatial dependence Equation

7.27 (see Chapter 7) are shown in Figure 54. The figure (real data) illustrates that the

CHAPTER 8

228

scatter in the spatial dependence model estimates increases with increasing N. The scatter

in the real dataset results may also be attributed to natural and sampling variability from

site to site given that the concurrent record length for analysis was very modest and that

strict homogeneity was not established. Also, further scatter could be attributed to the

overall limitation in the GEV methodology used here in estimating Ne. This introduces

higher uncertainties in Ne estimates for larger N values which really just reflects the larger

number of data points in the larger networks. This would certainly have a detrimental effect

for large flood quantile estimation. Figure 54 and Table 44 show the overall satisfactory

performance of Equation 8.1 as mostly the simulated and real dataset results are quite

similar.

Table 44 Properties of the Constant Ne Spatial dependence model

Real data Simulated data Region/State

a b R2 % a b R2 %

NSW+QLD+VIC 0.99 -0.66 89 1 -0.63 99

TAS 0.98 -0.59 79 1 -0.63 99

WA 0.99 -0.61 83 1 -0.63 99

SA 1.02 -0.75 84 1 -0.63 99

NT 0.99 -0.59 64 1 -0.62 99

All AUSTRALIA 1 -0.66 88 1 -0.63 99

CHAPTER 8

229

2.52.01.51.00.50.0

2.5

2.0

1.5

1.0

0.5

0.0

ln (Ne) from data

ln (

Ne)

Con

stan

t N

e M

odel

2 stations4 stations8 stations

Real Data

2.52.01.51.00.50.0

2.5

2.0

1.5

1.0

0.5

0.0

ln (Ne) from data

ln (

Ne)

Con

stan

t N

e M

odel

2 stations4 stations8 stations

Simulated Data

Figure 54 Comparison of directly computed Ne from the AMFS data and Ne by the constant Ne model

CHAPTER 8

230

8.4.2 FURTHER DISCUSSION

The coefficients of Equation 8.1 given in Table 44 suggest that the estimated constant Ne

N for independent stations ( 0 ) as should be the case. However, for totally dependent

flood data ( 1 ), the estimated constant Ne 1, in contrast to the theoretical expectation.

This could be a manifestation of the simple equation for the relationship between the

estimated constant Ne, N and used. It is noted a quadratic equation might improve the fit

in the regions with high correlation. Indeed for 1 , the errors in the estimated constant Ne

are high for large N values. However, it should be kept in mind that the aforementioned

issue would have little effect on the estimates from the methods using the LFRM approach,

as the average correlation coefficient between sites is normally much smaller than one.

The use of the constant Ne model in applications with the LFRM approach is quite general

and can be used anywhere in Australia. The main concern which could arise might be the

difficulty in calculating the correlation coefficient for pairs of stations where only a limited

concurrent flood record is available. In such a situation the alternative approach would be

firstly to compute the correlation coefficient from a regional relationship with distance (see

Figure 39 – Chapter 7) and then to apply Equation 8.1.

8.5 COMPARISON OF THE EFFECTIVE RECORD LENGTH ESTIMATES

USING THE CONSTANT Ne MODEL FOR THE REAL AND SIMULATED

DATASETS

The effective record lengths were estimated using the real and simulated constant Ne

models and Equation 7.9 (see Chapter 7). Figure 55 shows the typical variation of the total

record lengths and the effective record lengths from the real and simulated constant Ne

models. As expected, the differences are modest and that the effective record length

estimates from the simulated Ne model are slightly higher than those of the real Ne model. It

can also be observed that the average correlation coefficient consistently decreases with

increasing number of stations. This may be attributed to the fact that the extreme

observations from a network tend to be more independent, regardless of the high degree of

correlation which more frequent flows may exhibit. Similar results were found by

Nandakumar et al., (1997) with rainfall data.

CHAPTER 8

231

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

Number of Stations

Ave

rage

cor

rela

tion

coef

ficie

nt

0

5000

10000

15000

20000

25000

Rec

ord

Leng

th (y

ears

)

Australia Ne (Real)Australia Ne (SIM)Australia L (Real)Australia Le (Real)Australia L (SIM)Australia Le (SIM)

Figure 55 Variation with number of sites: effective record lengths estimated using real and simulated

Ne models as a function of average correlation coefficient

8.6 REVISITING THE LFRM IN THE LIGHT OF SPATIAL DEPENDENCE

The LFRM for the study data in its current form (see Equations 7.7 and 7.8 – Chapter 7)

does not allow for the effect of inter-site dependence which reduces the net information

available for regional analysis. In this section spatial dependence is accounted for through

the use of the spatial dependence model derived in the previous sections (see Equation 8.1),

which defines the effective number of independent stations in a region (Ne) as a function of

the average correlation coefficient in the region. For this study, the use and calculation of

Ne for application with the LFRM is illustrated. Firstly, the average correlation for each

pair of sites was calculated for each state/region. The average correlation coefficients are

shown in Table 45.

CHAPTER 8

232

Table 45 for each pair of sites for the different states/region

Secondly, using Equation 8.1 along with the coefficients for the Australian spatial

dependence model given in Table 44 (using the real and simulated data) and the average of

(0.26) the Ne was estimated. The calculated Ne value along with the effective record

length is given Table 46. One can see from Table 46 that the results from the real data

match reasonably well with the simulated data, which represents the result if the region

were truly homogeneous. Another way of estimating the number of effective sites would be

to use the individual coefficient results for each state/region from Table 44 with Equation

8.1 along with the average correlation coefficient from Table 45. However, as discussed in

section 8.4, significant regional variability from state to state may exist and the use of a

general model i.e. the model using the Australian model coefficients is preferred.

Table 46 Total record length (L) and effective record length (Le) for the all Australian

dataset

Region N L Constant Ne model – real

coefficients

Constant Ne model –

simulated coefficients

Ne* Le Ne

* Le All Australia

626 21049 207 (33%) 6969 228 (36%) 7654

* Ne values in parentheses are percentages of N

Using the calculated Ne value of 207 (from the real dataset) in Equation 7.6 (Chapter 7)

instead of the total number of stations (626) to estimate the new plotting position of the

pooled data points (1 max, 3 max and 5 max), the new interpolated curve for Equation 7.7

(Chapter 7) becomes:

Region/State

NSW+QLD+VIC 0.22

TAS 0.20

WA 0.21

SA 0.42

NT 0.25

Average of 0.26

CHAPTER 8

233

NeNeNeCYCYCY TT 32

2

1max )()( (8.2)

Equation 8.2 is then substituted into Equation 7.8 which yields the new definition i.e.

Equation 8.3 of the LFRM that has corrected for the spatial dependence in the dataset.

Equations 8.2 and 8.3 yield the analytical expression of the LFRM model for the study data

set using the 1, 3 and 5 maxima. The appropriate values of the coefficients of Equations 8.2

and 8.3 are given in Tables 38 (Chapter 7) and 47. One can clearly see the difference in the

coefficients of the LFRM when comparing the results of the dataset using N and Ne sites;

this is due to the reduction of the total useful information (i.e. the effective number of

stations). The new interpolated frequency curves can be seen in Figure 56 (top curve).

CVCYCYCcmeanQNeNeNe TT ))()((/ 32

2

1max (8.3)

Table 47 Coefficients and R2 values of Ymax polynomial interpolation from Figure 56 for N

and Ne sites

Ne sites - Ymax NeC1

NeC2

NeC3

R2

1 -0.025 0.71 1.42 0.996

3 -0.045 0.95 0.78 0.997

5 -0.054 1.06 0.44 0.999

N sites - Ymax C1 C2 C3 R2

1 -0.027 0.80 0.49 0.997

3 -0.041 0.98 -0.18 0.998

5 -0.044 1.07 -0.59 0.999

CHAPTER 8

234

Figure 56 Frequency distribution of standardised Ymax values using N and Ne stations

CHAPTER 8

235

What is indeed striking from Figure 56 is the shift upwards in the frequency curve of the

pooled data. Taking the 5 max plot for example, if one compares the Ymax value of

approximately 3, it can be seen that if one ignores the spatial dependence, the flood

magnitude risk may be notably under estimated (for N sites Ymax = 3, ARI = 55 years, for

Ne sites Ymax = 3, ARI = 20 years). For the pooling of the 5 max and correcting for spatial

dependence (see max of 5 plot in Figure 56) it was found that the range of Ymax values for

which the fitted model i.e. (referred to as LFRM_Ne henceforth) might be considered

reliable is approximately 1.5 to 5, which corresponds to ARIs of 10 to approximately 3000

years.

Figure 57 shows the behaviour of the dimensionless quantiles derived from Equations 7.8

(Chapter 7) and 8.3 for ARIs 50, 200 and 1000 years for all the pooled data (i.e. 1 max, 3

max and 5 max) for the estimated quantiles using N and Ne. The dimensionless quantiles

for the World model (referred to as the PM (world) – based on 8500 gauging stations

around the world) developed by Majone et al. (2007) is also superimposed for comparison.

The comparison with the PM (world) curves in Figure 57 indicates that the LFRM_Ne (1

max (626 data points), 3 max (1878 data points) and 5 max (3130 data points)) can explain

most of the scatter in these plots, as the set of curves (50 and 200 year ARI curves) for this

extended ARI range (including the 1000-year ARI) captures most of the upper part of the

points in the pooled data set of the Q/mean values. However the PM (world) seems to over

estimate the Q/mean values for the Australian dataset as the growth curve for Q1000/mean

is located above the scatter. The flatter slopes in Figure 57 for 3 max and 5 max (bottom 2

panels of Figure 57) is consistent with what was shown in Figure 43 (Chapter 7) and seems

to reflect a weaker relationship of Q/mean with CV. Comparison of the curves for max of 1

for Ne and N seems to indicate that allowance for spatial dependence has a smaller

influence on slope. Figure 57 indicates that the extra data with 3 to 5 max provides some

better definition of the left hand tail of the distribution (where the top few points in the

right hand tail are mostly common in all 3 data sets (1 max, 3 max and 5 max)).

CHAPTER 8

236

Figure 57 Various Qmax/mean quantiles derived from the LFRM_Ne model and PM (World) model

CHAPTER 8

237

Table 48 lists the CV values for the different states of Australia along with catchment area

and largest Ymax values of the pooled data. Figure 58 shows how the LFRM_N (i.e.1 max)

without correction for spatial dependence and the LFRM_Ne (i.e. 1 max) fit the at-site data

for the different ranges of CV values. As can be seen from this figure, the LFRM_N and

LFRM_Ne can provide reasonably accurate growth curve estimation for the ARI range of

10 to 1000-years and for CV values in the ranges 0.50 - 0.59, 0.60 - 0.69, 0.70 – 0.79, 0.80

– 0.89, 0.90 – 0.99, 1.00 – 1.10, 1.11 – 1.20, 1.21 – 1.40 and 1.41 – 1.60, and performs best

in the CV range of 0.60 - 1.60 (approximately 81% (505 out of 626) of the study

catchments fall in this range). However, the LFRM_N and LFRM_Ne perform quite poorly

for CV values ranging from 0.18 to 0.49 and 1.62 – to 2.52 and for a range of ARIs as seen

in the plots of Figure 58. One can also see that average CV values (i.e. CVave) in Table 48

all fall in the best performance range of 0.60 - 1.60.

Table 48 CV values for study catchments in Australia

State Number of stations

Average record length (years)

CVmin CVav CVmax Amin

(km2) Aav

(km2) Amax

(km2) Ymax-1

max

VIC 131 33 0.32 0.86 1.69 3 320 997 5.26

NSW 96 34 0.58 1.08 1.83 8 352 1010 5.37

QLD 172 35 0.51 1.06 2.08 7 325 963 4.84

TAS 53 30 0.23 0.64 2.02 1.3 323 1900 5.74

WA 146 30 0.28 0.96 2.52 0.2 156 7406 5.47

SA 29 35 0.42 0.91 1.71 0.6 170 708 4.33

NT 55 35 0.18 0.84 1.49 1.4 581 4325 5.26

CHAPTER 8

238

CV for 0.50-0.59 - (1 max)

0

1

2

3

4

1 10 100 1000 10000 100000

ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 0.70-0.79 - (1 max)

0

1

2

3

4

5

6

1 10 100 1000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 0.80-0.89 - (1 max)

0

1

2

3

4

5

6

1 10 100 1000 10000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 0.90-0.99 - (1 max)

0

1

2

3

4

5

6

7

1 10 100 1000 10000 100000ARI (years)

Q/m

ean

LFRM_N

LFRM_Ne

CV for 0.18-0.49 - (1 max)

0

1

2

3

1 10 100 1000

ARI (years)

Q/m

ean

LFRM_N

LFRM_Ne

CV for 0.60-0.69 - (1 max)

0

1

2

3

4

1 10 100 1000 10000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CHAPTER 8

239

Figure 58 Empirical frequency distributions of Q/mean quantiles derived from the LFRM_N and

LFRM_Ne for different ranges of CV

CV for 1.00-1.10 - (1 max)

0

1

2

3

4

5

6

7

8

1 10 100 1000 10000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 1.11-1.20 - (1 max)

0

1

2

3

4

5

6

7

8

1 10 100 1000 10000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 1.41-1.60 - (1 max)

0

1

2

3

4

5

6

7

8

9

10

1 10 100 1000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 1.21-1.40 - (1 max)

0

1

2

3

4

5

6

7

8

9

1 10 100 1000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CV for 1.62 - 2.52 - (1 max)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

1 10 100 1000ARI (years)

Q/m

ean

LFRM_NLFRM_Ne

CHAPTER 8

240

8.7 APPLICATION OF THE LFRM MODEL TO UNGAUGED

CATCHMENTS

The main interest here is the application of Equation 8.3 to ungauged catchments, which

requires the estimation of the mean flood and CV for the ungauged catchment in question.

The BGLSR and ROI approach as discussed in Chapter 3 and applied in Chapter 5 was

used to develop the prediction equations for the mean flood and CV of the AMFS data as a

function of catchment and climatic characteristics (predictor variables). The prediction

equation for the mean flood used a ROI of 30-40 stations, while 65-80 stations were used

for the CV, based on the findings from past studies (e.g. Haddad and Rahman, 2012 and

Rahman et al. 2012) and which state was being analysed.

8.7.1 DERIVATION OF PRIORS FOR THE MEAN FLOOD AND CV

As discussed previously and in more detail in Chapter 3, in order to apply the Bayesian

approach to the regional regression problem, one needs to formulate and define prior

distributions for the く coefficients and for the model error variance. Following Reis et al.

(2005), no previous information on the く coefficients is available (this is the case for the

mean flood and CV), so an almost non-informative prior is used. It consists of a

multivariate normal distribution with mean zero and a large variance such that the prior

distribution is relatively flat in the region of interest.

The prior information for the model error variance 2 (for the mean flood and CV) is

represented by an informative one-parameter () exponential distribution, which represents

the reciprocal of the prior expected mean value of the model error variance:

,)/1()( /22 e where 2 > 0 (8.4)

For the regionalisation of the mean flood, was set to the reciprocal of the residual error

variance estimate from ordinary least squares regression. This value is taken as the

expected prior mean of the model error variance.

CHAPTER 8

241

Previous studies show that the model error variance of a GLS regional regression model of

scale and/or shape parameters may be zero if the method of moments (MOM) estimator is

employed (Madsen and Rosbjerg, 1997; Madsen et al. 2002; Reis et al., 2005 and Haddad

et al., 2011b). This actually implies that the regional regression model is perfect, which is

considered to be unrealistic. Here, the Bayesian approach is developed further for the

analysis of a GLSR regional model that is employed to estimate the CV of AMFS. The

BGLSR model should provide a more reasonable estimator of the regional CV and its

uncertainty than the alternative MOM approach. One may also regionalise the standard

deviation of floods as done in Chapter 5; however regionalising CV allows its use more

directly with the LFRM concept.

For the regionalisation of CV, λ was set equal to 10. The rationale behind these numbers is

explained as follows. If one inspects the Figure 58, it can be seen that the LFRM performs

best in the range of CV values from 0.6 to 1.60. Hence, if the true CV values were

uniformly distributed between 0.5 and 2, the variance would be approximately 1/5, which

means the model error variance should be less than 1/5. However, in order to be more

realistic, λ was set equal to 10. In this case, there is still a probability of 14% that 2 is

greater than 1/5.

8.7.2 ESTIMATION OF THE ERROR COVARIANCE MATRIX – ESTIMATION

OF THE SAMPLING ERROR VARIANCE

In BGLSR modelling, one requires an estimate of the sampling error covariance matrix.

However, it is difficult to obtain an exact expression for the error covariance matrix and its

estimate is generally based solely on the data as adopted by Stedinger and Tasker (1985)

and Madsen et al. (2002). In general, approximate expressions of the sampling error

variances for the mean flood and CV of floods can be formulated in terms of population

parameters. It must be noted though, to solve the BGLSR equations, the error covariance

estimator should be independent, or nearly so of the AMFS parameter estimate

iy (Stedinger and Tasker, 1985). Following a similar approach as outlined by Madsen and

Rosbjerg (1997) and Madsen et al. (2002), an estimation procedure of the sampling error

variance that is nearly independent of the two AMFS parameters is described below.

CHAPTER 8

242

For the mean flood estimation (this was derived as the average of the AMFS at a site), the

sampling error variance is given by iii n/22 where 2

i is the population variance. A

reasonable estimate of 2

i can be obtained from:

n

i

iii nqnq1

22 ˆ/1,/ˆ (8.5)

For estimation of the sampling error variance that is nearly independent of the at-site CV

estimate, the approximation suggested by Madsen and Rosbjerg (1997) and Reis et al.

(2005) is used, which is given by:

)|()ˆ( a

i

ai nyVar

n

nyVar n

i

iyn

y1

ˆ1

n

i

ia nn

n1

1int (8.6)

where )|( anyVar is the sampling variance computed as a function of the mean of the

statistic of interest (CV in this case) in the region and na is the average number of

observations in the region.

8.7.3 ESTIMATION OF THE SAMPLING ERROR – INTER-SITE

CORRELATION

For the estimation of cross correlation of parameter estimates between sites, all

corresponding AMFS data with concurrent record lengths were considered. The cross

correlation between the sample mean valuesmeanij is equal to the correlation coefficient

between the concurrent AMFS themselves. However, the correlation between higher order

sample moments depends on the order of the moment (Stedinger, 1983). For example, for

the CV estimates, the cross correlation coefficient is given by 2

ijcvij . Therefore, the

effect of cross correlation dependence would become less severe for the higher order

moments. In reality, the estimated cross correlation coefficients have reasonably large

sampling uncertainties associated with them. Therefore, direct use of the sample estimates

may result in an error covariance matrix (see Chapter 3) that cannot be inverted. To

overcome this problem, the cross correlation coefficients are smoothed by relating the

sample estimates to the distance between stations. In this study the following exponential

CHAPTER 8

243

correlation function is used:

ln

1exp

1

ij

ijd

d

ijd

dij

ij

(8.7)

where dij is the distance between stations i and j, and and are parameters to be

estimated from the data.

8.7.4 SOME ISSUES ASSOCIATED WITH REGIONAL ESTIMATION OF CV

In the plot (Figure 59), the sample values of CV calculated for the considered sites were

initially plotted against the corresponding catchment areas (an initial assumption was made

that CV might show some relationship with catchment area). It can be seen that there is a

high scatter of the data and that the high CV values correspond to a range of catchment

areas. Due to the high scatter of the data, Figure 59 cannot be used directly for the

estimation of CV in practical cases. As such the use of regression equations or more

formally the BGLSR in terms of catchment and climatic characteristics is most appealing.

Sections 8.7.5 and 8.7.6 provide further details on this.

0

0.5

1

1.5

2

2.5

3

0.01 0.1 1 10 100 1000 10000Area (km2)

CV

(Q)

Figure 59 Relationship between CV and catchment area

CHAPTER 8

244

8.7.5 SELECTION OF PREDICTOR VARIABLES

All the predictor variables as outlined in Table 3 (Chapter 4) were used as potential

predictors. Predictor variables were selected according to the approach outlined in Chapter

3 and section 3.6. To identify the model form, a fixed region approach was used where all

the catchments were considered to have formed one region (each state separately) and the

final choice for the preferred regional BGLSR model for the mean flood and CV was the

combination that best satisfied all the statistical criteria as discussed in section 3.6.

8.7.6 BGLSR RESULTS FOR MEAN AND CV

The stepwise regression procedure for selecting the best set of catchment/climatic

characteristics resulted in the following equation form (e.g. Equation 8.8 and 8.9) for the

mean flood (mean) and CV for each Australian state. The regression equations are

presented in general form below, while the coefficients expressed by its posterior mean

value, (i.e. ) for the final selected equations are tabulated in Table 49 along with the

model error variance (MEV), pseudo coefficient of determination ( 2

GLSRR ) and standard

error of prediction in (SEP) %.

mean = 0 + 1(area) + 2(2I12) (8.8)

CV = 0 (8.9)

Table 49 Summary of the finally selected BGLSR models for all the Australian states used

in the validation of LFRM

Coefficients (く)

Coefficient

(く) MEV MEV 2

GLSRR 2

GLSRR

SEP

(%)

SEP

(%)

State Mean Flood CV

Mean

Flood CV

Mean

Flood CV

Mean

Flood CV

0 1 2 0

VIC 3.72 0.61 1.14 0.88 0.29 0.0047 0.62 - 60% 15

NSW 4.62 0.69 2.05 1.14 0.29 0.0078 0.76 - 60% 15

QLD 5.20 0.65 1.70 1.06 0.16 0.0041 0.81 - 42% 12

TAS 4.77 0.79 2.11 0.56 0.39 0.016 0.80 - 72% 22

WA 0.32 0.82 1.19 0.97 0.88 0.010 0.81 - 122% 19

CHAPTER 8

245

Figures 60 and 61 show example plots of the statistics used in selecting the best set of

predictor variables for the CV model for the state of NSW. Some sample figures associated

with the other states can be seen in Appendix C. Figure 60 shows the MEV, standard error

of MEV and 2

GLSR values for the CV model. Combination 6 with a constant and two

predictor variables area and 2I12 showed the lowest MEV and the one of the highest 2

GLSR

as well as one of the lowest Akaike information criterion (AIC) and Bayesian information

criterion (BIC) values. However, the lowest average variance of prediction old (AVPO)

and average variance of prediction new (AVPN) were found for combination 1 (a constant

value - see Figure 61). The adopted combinations of the predictor variables are as noted in

Chapter 5, Table 4 column 2.

The Bayesian plausibility values (BPV) was used to carry out a hypothesis test (at the 5%

significance level) on the predictors of combination 6. The BPVs were found to be 79%

and 11% for area and 2I12 respectively; this shows the predictors not to be significant for

the estimation of CV at ungauged sites. Both the posterior coefficients く1 and く2 were less

than two posterior standard deviations away from zero supporting the results from the BPV

test that these variables are not significant.

The result above suggests that it may be possible to adopt a regional average CV value for

NSW without using any prediction equation/predictor variable. This finding is consistent

with Chapter 5 where it was found that a constant model for a regional skewness was the

best model for NSW and other Australian states. The finding above is also supported by the

fact that there was only a modest difference in the MEV values where combination 6

showed an MEV of 0.0076 compared to an MEV of 0.0078 for combination 1.

A similar outcome was observed for the estimation of CV for all the Australian states (see

figures in Appendix C). While there were cases where the prediction equations showed

reasonably high 2

GLSRR and low MEVs and AVPs, the BPV results consistently showed

these variables to be not significant. For this study, the simplest model was always

preferred.

CHAPTER 8

246

0%

10%

20%

30%

40%

50%

60%

70%

80%


R2 G

LSR

0.0073

0.0074

0.0075

0.0076

0.0077

0.0078

0.0079

0.0080

0.0081

0.0082

0.0083

ME

V a

nd it

s S

tand

ard

Err

or

R-Sqd GLSR MEV Standard error of MEV

Figure 60 Selection of predictor variables for the BGLSR model for CV

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10


AVPO AVPN AIC BIC

Figure 61 Selection of predictor variables for the BGLSR model for CV using AVPO, AVPN, AIC and

BIC

CHAPTER 8

247

Figures 62 and 63 show an example plot of the statistics used in selecting the best set of

predictor variables for the mean flood model for NSW. According to the MEV,

combinations 3, 4, 5, 6, 10 and 11 were potential sets of predictor variables for the mean

flood model. Combinations 5, 6 and 11 contained 2 predictor variables with similar MEVs

and 2

GLSR .

The AVPO, AVPN, AIC and BIC values favoured combination 6, and hence this was

finally selected as the best set of predictor variables for the mean flood model which

includes area and design rainfall intensity 2I12. Both posterior coefficients く1, and く2 were

found to be 7 times the posterior standard deviation away from zero suggesting these two

variables are well defined in the prediction equation. Combination 6 was selected for all the

mean flood models for all the Australian states in the validation. The BPVs for the

regression coefficients associated with the variable area and 2I12 were found to be

significant with values smaller than 0.001%.

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00


MEV Standard error of MEV R-Sqd GLSR

Figure 62 Selection of predictor variables for the BGLSR model for the mean flood

CHAPTER 8

248

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40


AVPO AVPN AIC BIC

Figure 63 Selection of predictor variables for the BGLSR model for the mean flood using AVPO,

AVPN, AIC and BIC

8.7.7 BGLSR RESULTS FOR MEAN AND CV MODELS USING ROI

The regression equations based on the sets of predictor variables selected were used in the

ROI approach. The results obtained in the ROI approach were then used in the validation of

the LFRM_Ne model (i.e. Equation 8.3 in section 8.6). In the ROI approach, an optimum

region was formed for each of the 28 test catchments (see Figure 40, Chapter 7). As stated

earlier, the prediction equation for the mean flood used a ROI of 30-40 stations, while 65-

80 stations were used for the CV model, based on the findings from past studies (e.g.

Haddad and Rahman, 2012 and Rahman et al. 2012) and the state in question. The

summary of the various regression diagnostics (as described in section 3.8 and Equation

3.41, Chapter 3) for each test catchment is provided in Table 50 for the different Australian

states.

This shows that for the mean flood model (for all the states), the MEV and average SEP

values are much higher than those of the CV models. This indicates that the mean flood

model exhibits a higher degree of uncertainty than the CV models (i.e. the mean flood

CHAPTER 8

249

would introduce more uncertainty into the LFRM model as compared to the CV). An

important note to make here is that sampling error has dominated the total error in the

estimation of CV as opposed to the mean flood, where the total error is dominated by

model error, therefore in case of the CV the spatial variation is a second order effect that is

not really detectable. This is apparent in both the fixed region and ROI approaches.

Table 50 Regression diagnostics for the ROI approach for the various Australian states and

test catchments

MEV MEV 2

GLSRR 2

GLSRR SEP (%)

SEP

(%)

State /

Station

No.

Mean

flood CV

Mean

flood CV

Mean

flood CV

VIC

221210 0.21 0.007 0.61 - 50 17

225218 0.13 0.007 0.63 - 38 18

227211 0.22 0.008 0.66 - 50 18

401210 0.23 0.007 0.50 - 53 18

403213 0.22 0.008 0.59 - 51 18

404206 0.23 0.008 60 - 52 18

NSW

203012 0.23 0.012 0.80 - 54 18

210014 0.29 0.012 0.80 - 60 17

215004 0.20 0.011 0.80 - 49 17

410057 0.30 0.011 0.81 - 61 16

412050 0.33 0.011 0.81 - 65 16

419029 0.22 0.012 0.79 - 51 17

QLD

108002 0.12 0.012 0.78 - 35 17

116015 0.13 0.011 0.77 - 37 17

140002 0.19 0.009 0.64 - 44 19

416410 0.15 0.011 0.77 - 40 20

422394 0.13 0.011 0.80 - 37 20

919013 0.12 0.012 0.78 - 35 17

WA

607012 0.52 0.017 0.85 - 88 22

608004 0.49 0.016 0.83 - 83 22

610001 0.41 0.014 0.82 - 74 22

CHAPTER 8

250

610007 0.43 0.014 0.82 - 76 23

612008 0.59 0.016 0.79 - 96 23

612010 0.53 0.015 0.80 - 89 23

TAS

2204 0.42 0.024 0.75 - 68 20

4201 0.41 0.023 0.78 - 66 20

304040 0.34 0.023 0.80 - 60 20

308799 0.38 0.023 0.85 - 63 20

For the mean flood model (for all the states), the ROI approach shows a MEV value which

is generally smaller than the fixed region approach (see Tables 50 and 49). The lower MEV

value in turn also provides the lower SEP values. Also, the 2

GLSRR

values for the mean flood

model (all the states) with the ROI approach in most cases are higher than the fixed region

approach. These results indicate that the ROI approach should be preferred over the fixed

region approach for developing the mean flood model for use with the LFRM_Ne model.

The MEV and SEP values for the CV model are very similar for the fixed region and ROI

approaches for all the states (see Tables 49 and 50). These results indicate that either the

ROI or fixed region approach is suitable for developing the CV model for use with the

LFRM_Ne model. For the validation of LFRM_Ne model in this study the CV model based

on ROI is used.

From the above analysis it is clear that if a MOM estimator were used to estimate the MEV

( 2ˆ ) for the CV model, the uncertainty would have been grossly underestimated as the

sampling error has heavily dominated the regional analysis. This would lead to an over

reliance on the regional model. A more reasonable estimate of the MEV has been achieved

in this study with the Bayesian MEV estimator as it represents the values of 2ˆ by

computing expectations over the entire posterior distribution. One can see that the

exponential prior used in the Bayesian analysis for this study has some influence on the

posterior distribution for the CV model. In the case of the CV and the ROI approach for

NSW (as shown in Figure 64), the posterior density function for the MEV is non-zero at the

origin, as would always be the case if 0 .

CHAPTER 8

251

Figure 64 Prior and posterior pdf's for the model error variance for CV (right) and the mean flood

(left) models for NSW state

8.8 VALIDATION

The prediction equations developed above using ROI approach, and Equation 8.3

(LFRM_Ne model), were applied to the 28 test catchments, which were not used in

developing the prediction equations. To make the comparison more useful and to

benchmark the LFRM_Ne model, the developed prediction equations were also used to

estimate the mean flood and CV with the PM (world) model developed by Majone et al.

(2007). It must be pointed out however, that the PM (world) model does not contain any of

the data used to develop the Australian LFRM. The validation analysis was undertaken for

ARIs up to 1000 years. ARIs in the range of 50 and 100 years were compared with at-site

flood frequency analysis (FFA) (obtained from the fitted LP3 distribution – see Chapter 3

for more details on this). Validating beyond the 100-year ARI with at-site FFA estimates

was not viewed as reliable given the very large extrapolation errors involved. Indeed, any

validation results obtained beyond the 100-year ARI would be of little significance for

most of the stations.

CHAPTER 8

252

For the larger ARIs (200, 500 and 1000 years), comparison was made against the results

obtained from another regional method where the parameters of the LP3 distribution (i.e.

mean, standard deviation and skew) were regressed against catchment characteristics

(known as the PRT - see Chapters 3 and 5 for more details) and flood quantiles were then

derived for the 200-, 500- and 1000-year ARIs. The extrapolation of these distributions to

the large ARIs also involves a large degree of uncertainty.

To assess how well the developed prediction equations approximate the observed flood

quantiles, two numerical measures were applied. Relative bias (BIASr – defined by

Equation 8.10) was used to assess whether the predicted flood quantiles by the LFRM_Ne

or PM (world) models systematically under- or overestimated the at-site FFA or the PRT

estimates on average, considering all the 28 test catchments.

100))((_

)())((_1BIAS

1

r

testn

i e

e

test worldPMNLFRM

PRTFFAworldPMNLFRM

n (8.10)

where testn represents the 28 test catchments used in the validation.

The relative error values (REr – defined by Equation 3.44, Chapter 3) with respect to the at-

site FFA or regional PRT estimate were also obtained. This is by no means the true error of

the LFRM_Ne or PM (world) models; the estimated errors represented here by both the

BIASr and REr may be taken as a reasonable indication of consistency of the LFRM_Ne or

PM (world) models as compared to FFA and PRT estimates. Here, both the FFA and PRT

estimates are associated with a higher degree of uncertainty due to considerable

extrapolation involved. It is worth noting here that in calculating the median relative error

(REr), the sign of the relative errors was ignored.

Table 51 summarises the various error statistics with the LFRM_N (i.e. no spatial

dependence) and LFRM_Ne models (considering the pooling of 1 max, 3 max and 5 max)

and the PM (world) model based on the 28 test catchments. If one ignores the issue of

spatial dependence in the Australian dataset one can see that the estimation for the ARI of

1000-years using the LFRM_N model suffers from minor underestimation on average (e.g.

BIASr of 1%) for the ungauged catchment case. Moreover, from Table 51, it can be seen

for 1 max and when the pooling of more data is undertaken (i.e. 3 max and 5 max), and

CHAPTER 8

253

spatial dependence (LFRM_Ne) is compensated for, the BIASr is well corrected. For

example from Table 51, for the 1000-year ARI, the BIASr for 1 max, 3 max and 5max and

LFRM_Ne are 5, 8 and 9% overestimation on average, respectively.

Focusing on the discussion for the 5 max results, for the ARIs of 50 to 1000-years, the

BIASr values are positive for both the LFRM_Ne and PM (world) models suggesting an

overestimation (on average) by both the models. When compared to the results of

preliminary LFRM models (i.e. Haddad et al. 2011b), the results obtained in this chapter

present a significant improvement. As found in Haddad et al. (2011b) the underestimation

on average was up to 40%). By pooling more data and also accounting for the inter-site

dependence in the LFRM model, the underestimation problem has been rectified. The

results as benchmarked against the PM (world) model are reassuring; this indeed places a

higher degree of confidence in the estimates given by the LFRM_Ne model developed here.

The REr values in Table 51 show acceptable results, which are comparable to similar

regional models for the smaller ARI ranges (see Chapter 5 and also Rahman et al., 2012).

Focusing on the 5 max results, the REr values range from 31% to 61% (which are also very

comparable to the PM (world) model), which suggest that the LFRM_Ne model performs

very well given the higher uncertainty associated with the larger ARI estimation using FFA

and PRT. It should be noted that in the PM (world) data set most of the stations were so

well separated that they were mostly independent of each other and this was the reason why

Majone et al. (2007) did not need to work out an effective number of sites. This may also

be the reason for the PM (world) model performing quite well in the validation here as

well. The LFRM_Ne model in this study has refined the approach of the PM (world) model

as significant inter-site dependence exists between stations in the Australian data set.

A confidence interval plot of the BIASr values is given in Figure 65 which displays the

central tendency and variability of the sample BIASr values. Figure 65 displays the mean

value (circle symbol) with a 95% confidence interval bar for flood quantiles 100 – 1000-

years ARIs. While the mean values appear to be different for the two methods (i.e.

LFRM_Ne and PM (world) models), the difference is not significant because the interval

bars overlap, suggesting that the LFRM_Ne model to be comparable to the PM (world)

model. Moreover, it proves that consistency is achieved for the 3 and 5 max pooling

CHAPTER 8

254

LFRM_Ne model as the mean values and the spread of BIASr values are very similar to the

PM (world) model.

Overall, the results generally show a good agreement between the estimates of the

LFRM_Ne/ PM (world) model and at-site FFA / PRT results. For the 1000-year ARI (5

max), the results can be regarded as ‘good’ for 20 out of the 28 test catchments and

‘acceptable’ for 2 test catchments and ‘poor’ for the remaining 6 test catchments. These

sorts of results are typical in Australian RFFA studies for the range of ordinary ARIs also

(e.g. 2 to 100 years).

It was also found that the catchments that showed under estimation were common for both

the methods. It is worth noting that the LFRM_Ne model on average always shows

overestimation relative to the PRT quantile estimates for some of the test catchments for

ARIs of 500- and 1000-years. This is a vast improvement as compared to the preliminary

LFRM model presented by Haddad et al. (2011b), where 17 out of the 18 test catchments

showed underestimation. The improvement in the results for the LFRM_Ne model

developed here may be attributed to the fact that the model pools more data and corrects for

the spatial dependence of the pooled standardised data. Indeed, taking into account the

degree of inter-station correlation has clearly reduced the negativte bias of the flood

quantile estimates notably. It is envisaged that as a part of the future assessment of the

LFRM_Ne model comparisons will be made against design flood estimates obtained by

alternative methods (e.g. spillway design and dam safety studies based on design rainfall

based approaches).

CHAPTER 8

255

Table 51 Summary of error statistics obtained from independent testing associated with the

LFRM model

1 max LFRM_N

ARI (years) BIASr (%) REr (%)

Model LFRM_N World Model LFRM_N World Model

50 30 39 53 60

100 12 23 54 61

200 12 26 29 33

500 6 24 34 30

1000 -1 19 38 32

1 max LFRM_Ne


Model LFRM_Ne World Model LFRM_Ne World Model

50 47 39 57 60

100 25 23 62 61

200 23 26 29 33

500 14 24 31 30

1000 5 19 34 32

3 max LFRM_Ne



50 50 39 57 60

100 29 23 61 61

200 26 26 31 33

500 18 24 31 30

1000 8 19 35 32

5 max LFRM_Ne



50 51 39 57 60

100 30 23 61 61

200 28 26 31 33

500 19 24 32 29

1000 9 19 35 32

CHAPTER 8

256

Figure 65 Confidence interval plot of BIASr values with the LFRM_Ne and PM (world) models for the 28 test catchments

CHAPTER 8

257

8.9 SUMMARY

This chapter has developed and tested the performance of a new LFRM that also accounts

for spatial dependence in the AMFS data. This uses a comprehensive Australian AMFS

dataset that consisted of 654 stations.

To estimate the equivalent number of independent sites (Ne), a simple model was derived

that ignored possible variation with ARI. To be able to establish meaningful results

regarding spatial dependence, the analysis was also carried out on simulated datasets to

check the sampling and homogeneity issues. Overall, the experimental results showed that

spatial dependence decreased with larger network sizes generally and that some Australian

states exhibited a greater degree of spatial dependence than others. While there were

limitations with this analysis, a reasonable indication of the behavior of Ne was established.

The spatial dependence model was then generalised by developing an empirical

relationship between Ne and the average correlation coefficient in a network of the AMFS

data. To avoid inter-regional variation between the states, a general Australian spatial

dependence model was established. To be able to determine the functional form of the

spatial dependence model the analysis was carried out for the real and simulated datasets. It

was shown that both the real and simulated model coefficients were quite similar. It was

also illustrated that the scatter in the generalised spatial dependence model estimates

increased with increasing number of stations (N).

The LFRM was then revisited in the light of spatial dependence which was established with

the derived generalised spatial dependence model. By pooling the top 5 maxima and

correcting the plotting position points, the regional growth curves showed a shift upwards

and that the new LFRM (termed LFRM_Ne model henceforth) was seen to be considered

reliable up to the 3000-year ARI.

In the last few sections of this chapter the LFRM_Ne model was applied to the ungauged

catchment case where 28 test catchments not used in the development of the LFRM model

were used in the validation. This was achieved by developing regional regression equations

for mean flood and CV of the AMFS data as a function of catchment/ climatic

CHAPTER 8

258

characteristics. BGLSR (see Chapter 3 and 5 for details) and the ROI framework were used

to achieve this. It was found that the mean flood can be described by two predictors, which

were area and a representative design rainfall intensity. The CV showed no real

dependence with any predictors, and as such a regional average value was adopted for all

the states.

Finally, this chapter presented a validation which was undertaken to compare the flood

estimates from the LFRM_Ne model to those from established methods. For the estimation

up to the 100-year ARI the LFRM_Ne model results were compared to at-site flood

frequency analysis (FFA) results. For the larger ARIs (i.e. greater than 100-year ARI) they

were compared to estimates from the parameter regression technique. The LFRM_Ne model

was also bench marked against the world model (i.e. PM (world)) as established by Majone

et al. (2007). It was found that the LFRM_Ne models that pool 3 and 5 maxima were able

to estimate the 1000-year ARI flood quantile with only small positive bias on average, with

very acceptable median relative errors. When compared with the PM (world) model, the

LFRM_Ne model produces consistent results. A note is made here that the dataset used to

establish the LFRM is totally independent of the PM (world) dataset. Overall the results

from the LFRM_Ne model are considered to be an improvement over the results of the

preliminary LFRM model by Haddad et al. (2011b). This indeed presents a notable

improvement to the way large floods can be estimated by regional methods for ungauged

catchments in Australia and the world.

Slight underestimation still exists with the developed LFRM_Ne model for some the test

catchments, this is to be expected as any RFFA model generally cannot explain all the

variability found in the data given the simplicity of the RFFA approaches and the

asscoiated data errors. It is envisaged that further improvements and refinements can be

made in the future which are outlined in more detail in Chapter 9.

CHAPTER 9

259

CHAPTER 9: CONCLUSIONS

9.1 INTRODUCTION

This thesis focuses on design flood estimation problem in the ungauged catchments using

regional flood frequency analysis (RFFA). This, in particular, investigates the research

question of how flood quantile estimation in ungauged catchments can be improved by

adopting an ensemble of advanced statistical techniques. These techniques include

Bayesian generalised least squares regression (BGLSR), region of influence (ROI)

approach, Leave-one-out (LOO) and Monte Carlo cross validation (MCCV) procedures. A

large flood regionalisation model, which explicitly accounts for the spatial dependence in

the annual maximum flood series (AMFS) data in the regional flood modelling is also

proposed and investigated. The thesis also emphasises the importance on the collation of a

quality-controlled flood database and the issue of uncertainty estimation in RFFA methods.

Design flood estimation in the range of frequent to medium (2 – 100 years) and large to

rare (>100 to 2000 years) average recurrence intervals (ARI) is frequently required in the

design of many engineering works such as design of canals, spillways, dams, bridges, water

intakes, land use planning and flood insurance studies. These sorts of infrastructure works

and investigations are of notable economic significance, as highlighted in Chapter 1.

Traditionally, there have been several methods that are frequently adopted for these tasks.

For the frequent to medium floods, the most commonly adopted RFFA methods for small

to medium sized ungauged catchments include the probabilistic rational method (PRM), the

index flood method (IFM) and the quantile regression technique (QRT). In south–east

Australia, the PRM was recommended for general use in Australian Rainfall and Runoff

(ARR), mainly due to its simplistic nature and ease-of-use in application (I.E. Aust., 1987).

This thesis advocates the use of regression-based RFFA methods under the BGLSR

framework rather than PRM. The BGLSR has been developed and tested with the QRT and

the parameter regression technique (PRT). In forming the regions, both the fixed region and

ROI approaches have been examined in the range of frequent to medium ARI floods. The

detailed validation of the regional hydrological regression models has also been undertaken

using the popular LOO validation and the relatively new MCCV procedures.

CHAPTER 9

260

In addition, a simple LFRM that accounts for spatial dependence in the AMFS data for

estimating large to rare floods at both gauged and ungauged sites has been developed. The

new LFRM is easy to use and offers an alternative to the traditional rainfall-based methods.

While summaries of the various modelling, development and testing tasks have been

provided at the end of each chapter of the thesis, an overview and the major findings of the

thesis are presented below.

9.2 OVERVIEW OF THE STUDY

9.2.1 DATA SELECTION (CHAPTER 4)

Initially, over 1000 stations across the Australian continent were selected for the study

based on a number of criteria, such as catchment size, streamflow record length,

streamflow data quality, degree of regulation, urbanisation and land use change. Further

examination indicated that many of these stations did not satisfy the criteria of

homogeneity and representativeness for the purpose of RFFA. Moreover, to reduce the

potential effects of inter-decadal variability, the minimum length of records (after infilling

of missing records) was increased up to 25 years where possible. This was necessary due to

the presence of a long drought that affected many stations after the late 1980s. The stations

that suffered from excessive error, due to rating curve extrapolation, were excluded.

Finally, a total of 682 catchments were selected for the study. These catchments are mainly

rural with no known major land use changes over the periods of streamflow records.

An outlier test was conducted for each of the selected stations. The influence of errors on

flood frequency curves from the extrapolation of rating curves was minimised by placing

limits on the degree of extrapolation involved in estimating the largest observed flood

events using the in-built tool in the FLIKE software (which implements the principles

outlined in Kuczera, 1999a, b). A total of 8 catchment characteristics that are perceived to

mainly govern the flood generation process and are relatively easy to obtain were selected

for this study. These catchment characteristics data were extracted for each of the selected

catchment (refer to Chapter 4 for more details).

CHAPTER 9

261

9.2.2 RFFA IN THE FREQUENT TO MEDIUM ARI RANGE (CHAPTER 5)

Flood prediction equations were developed and compared for the states of New South

Wales (NSW), Victoria, Queensland and Tasmania (for ARIs of 2, 5, 10, 20, 50 and 100

years). Both the fixed region and ROI approaches in the QRT and PRT frameworks were

adopted, where the quantiles and parameters (i.e. mean, standard deviation and skew) of the

log Pearson Type 3 (LP3) distribution were regressed against the selected set of climatic

and catchment characteristics variables. The BGLSR procedure was adopted for the

estimation of the regression coefficients. The developed prediction equations (i.e regression

coefficients) were assessed in the ungauged catchment case by adopting a LOO validation

procedure.

9.2.3 MCCV VS LOO (CHAPTER 6)

Selecting the right regression model and ascertaining its predictive power are important

steps in any regional hydrologic regression analysis, which are usually undertaken by some

kind of validation e.g. split sample validation. This thesis assessed the performances of the

most commonly adopted LOO validation against the relatively new MCCV procedure.

Both the validation procedures (i.e. LOO and MCCV) were carried out under the the

ordinary least squares regression (OLSR) and GLSR frameworks for the estimation of

flood quantiles using simulated and regional flood data from the state of NSW in Australia.

9.2.4 LARGE TO RARE FLOOD ESTIMATION (CHAPTERS 7 and 8)

An overview of inter-site dependence in the Australian AMFS data was discussed.

Determination of homogenous regions and the identification of an appropriate probability

distribution were investigated and discussed in the context of the LFRM. The issues

relating to concurrent record lengths for the establishment of meaningful networks to carry

out the analysis of spatial dependence was presented. The theory of inter-site dependence

and the estimation of the number of independent sites using a simple model were derived.

Finally, the methodology underpinning the LFRM was developed for the Australian

continent and was applied with the developed spatial dependence model coupled with the

BGLSR. Here, the BGLSR was used to develop the prediction equations for the mean and

coefficient of variation (CV) of the annual maximum flood series data. The LFRM was

developed and tested to estimate large to rare floods for both the gauged and ungauged

CHAPTER 9

262

catchment case. A split-sample validation was also carried out to compare the results of the

LFRM with the established methods such as the PRT (refer to Chapters 3 and 5 for more

details) and the international method (i.e. World Model).

9.3 CONCLUSIONS

9.3.1 DESIGN FLOOD ESTIMATION IN THE FREQUENT TO MEDIUM ARI

RANGE

It has been found that the ROI performs better than the fixed region approach in

RFFA. Hence, the ROI approach should be used where there are enough

geographically contiguous gauged catchments in a state/region.

It has been found that the Bayesian GLSR is preferable to OLSR in developing the

prediction equations for flood quantiles and flood statistics.

It has been found that the QRT-ROI and PRT-ROI perform very similarly. Hence,

the PRT is a viable alternative to QRT for design flood estimation in ungauged

catchments. The developed RFFA methods based on the QRT-ROI and PRT-ROI

allow design flood estimation along with its associated uncertainty (in the form of

confidence limits) given the relevant catchment characteristics data for the gauged

or ungauged catchment of interest.

It has been found that catchment area and design rainfall intensity are adequate for

the estimation of the flood quantiles with the QRT. Furthermore, catchment area,

design rainfall intensity, mean annual evaporation, mean annual rainfall, main

stream slope and forest cover are needed in the PRT for the estimation of the second

and third parameters of the LP3 distribution.

LOO validation indicates that the ROI based on the minimisation of the predictive

uncertainty leads to more efficient and accurate flood quantiles estimates by both

the QRT and PRT. The regression diagnostics reveal that the catchment

characteristics variables alone may not pick up all the heterogeneity in the regional

model. Both the BGLSR based QRT-ROI and PRT-ROI methods show

improvements in regional heterogeneity with an increase in the average pseudo

CHAPTER 9

263

R2

GLS and a decrease in the model error variance, average variance of prediction and

the average standard error of prediction.

Both the standardised residual and quantile-quantile plots of the ROI analysis

satisfied the underlying model assumptions better than the fixed region regression.

It has been found that both BGLSR QRT-ROI and PRT-ROI produce smaller

average relative root mean square errors and median relative errors when compared

to the fixed region regression approach. Based on the evaluation statistics overall it

has been found that there are only modest differences between the BGLSR QRT-

ROI and PRT-ROI.

9.3.2 VALIDATION OF REGIONAL HYDROLOGICAL REGRESSION MODELS

From the simulation and real data examples, it has been found that when developing

regional hydrologic regression models, application of GLSR based MCCV

validation procedure is likely to result in the most parsimonious model as opposed

to the OLSR based LOO, OLSR based MCCV and GLSR based LOO validation

procedures.

The GLSR based MCCV has been found to exhibit the smallest mean squared

errors of prediction and also has fewer instances of problems with collinearity of

predictor variables as compared to the OLSR LOO and OLSR MCCV validation

procedures.

It has also been found that the MCCV and corrected MCCV (CMCCV) can

provide more reasonable estimate of a model’s predictive ability than LOO and that

the CMCCV has the potential to offer reasonable improvement over the MCCV in

estimating the prediction ability of a regional hydrologic regression model.

9.3.3 LARGE TO RARE FLOOD ESTIMATION

The development and application of a simplified LFRM that pools the top 3 and 5

annual maximum flood values from member sites in a region, coupled with the

CHAPTER 9

264

BGLSR and a newly developed spatial dependence model have been established for

Australia.

A simple model for the effective number of independent stations (Ne) has been

developed that ignores possible variation with ARI. Meaningful results regarding

spatial dependence are established by undertaking the analysis on simulated

datasets to counteract sampling and homogeneity issues.

Overall, the experimental results of the analysis show that, in general, spatial

dependence decreases with larger network size and that some Australian states have

more spatial dependence than others. While there are some limitations with this

analysis, a reasonable indication of the behaviour of Ne has been established.

Using the derived generalised spatial dependence model, the LFRM has been

corrected for spatial dependence by correcting the plotting position points of the

LFRM frequency distribution curve and as such the regional growth curves all show

a shift upwards.

Finally, the LFRM has been applied to the ungauged catchment case. An

independent validation shows that the developed LFRM is able to estimate design

floods for 100 to 1000 years ARIs with reasonable confidence as compared to the

world model.

Overall, the newly developed LFRM coupled with BGLSR and a

spatial dependence model offers a powerful yet simple method of regional flood

estimation for floods of large to rare ARIs.

9.4 LIMITATIONS AND SUGGESTIONS FOR FUTURE RESEARCH

The RFFA methods for the frequent ARIs developed in this study were based on the flood

database available in eastern Australia up to the years 2004/2005. It is expected that

availability of a more comprehensive database (in terms of both quality and quantity) will

further improve the predictive performance of both the fixed and ROI based RFFA

methods presented in this study, which however needs to be investigated in future when

CHAPTER 9

265

such a database is available. Also and accordingly with the availability of a more

comprehensive database further research should be directed to looking at the effects of

climate change in the developed RFFA model.

In the case of BGLSR – QRT or PRT approaches, most of the uncertainty can be accounted

for on the left hand side of the equation, i.e. the dependant variable. In most cases the

predictor variables (e.g. design rainfall) are also subject to various errors (sampling,

measurement and model errors). There has been no study on the effects of this error on

regional flood estimates. Therefore, the design flood estimates obtained in this thesis may

be biased in terms over estimating the model error variance, leading to uncertain regression

coefficients and statistical diagnostics which rely on the model error variance such as the

standard error of prediction and average variance of prediction.

In the conventional approach of RFFA using regression based procedures such as the

BGLSR-QRT or PRT approaches the predictor variables (for example design rainfall

intensity) that are statistically significant are chosen according to some goodness-of-fit

measure. The resulting regression relationship, along with the chosen predictor variables is

believed to be the “true” form of the model. In principle, this assumption is imperfect and

not satisfied with respect to:

(i) the predictor variables in the analysis are fixed (i.e. not random i.e. are assumed

not to be probability distributed); and

(ii) the predictor variables (e.g. design rainfall intensity) have underlying errors

(sampling, model and measurement errors, which are often ignored in the

analysis).

Firstly, the assumption of fixed predictor variables may not be satisfied in a hydrological

context. For example we would like to estimate the 10 year ARI flood quantile, we first

estimate values for the predictor variables (e.g. area, rainfall intensity and slope) and then

estimate the flood quantile Q10. In this case, the analysis chooses the fixed values for the

predictor variables. This approach is not considered to be a random outcome as outlined in

Koop (2008).

CHAPTER 9

266

Secondly, it is assumed that the predictor variables are error free. However, the predictor

variables being used in our RFFA study such as the design rainfall intensity values

published in Australian Rainfall and Runoff (ARR) (I.E Aust., 1987) is likely to suffer

from a great deal of uncertainty/error, e.g., these were estimated based on a limited rainfall

dataset, with many stations having very short records. The rainfall intensity estimates were

fitted with the LP3 distribution using the method of moment’s estimator. Thus, the

estimates were subject to a variety of errors (e.g. sampling variability, model error) which

may contribute to the overall errors in the final flood quantile estimates.

To this end, some specific questions impose themselves such as: What improvements can

be gained from including all the possible error (both dependent and predictor variables) in

the analyses on the final flood quantiles? How can this uncertainty in RFFA be quantified

and used to develop confidence limits with flood quantile estimates? As can be seen this is

a very involved area of research and as such was beyond the scope of this thesis. However,

it is recommended that this new research be undertaken as it will provide a new dimension

to our understanding of the uncertainty and errors in design flood estimation using RFFA.

The above future research suggestions may also be implemented in a hierarchical ROI

framework that includes dependence on exogenous covariates.

The following recommendations are made to further improve the LFRM method for large

to rare flood estimation:

A sensitivity analysis of the LFRM estimates to the number of selected highest

floods in each region or network should be investigated further based on a more

theoretical basis.

As a precursor to any analysis such as the LFRM, further data simulations should be

undertaken to determine the effects on estimated large flood values of any

violations of the basic assumptions on homogeneity and distribution (very useful in

this case as strict homogeneity was not satisfied and only the GEV distribution was

used (i.e. for derivation of the spatial dependence model)).

The search for a more appropriate form of the constant Ne model, or even the

introduction of a variable Ne model (e.g. variable with ARI) to estimate the effective

CHAPTER 9

267

number of independent stations should be identified based on more theoretical

investigations.

The influence of the constant inter-site correlation assumption in the simulated data

which was also used to identify the functional form of the generalised spatial

dependence ‘constant Ne model’ should be examined more closely by using a wider

range of constant correlations. A ‘variable Ne model’ should also be examined in

this framework. As such, the use of Multivariate Copulas is recommended (e.g.

Favre et al., 2004).

The uncertainties in the LFRM should also be investigated using a Monte Carlo

simulation method. The main sources of errors in the LFRM estimation are mainly

introduced through parameter estimation errors in the constant Ne model and in the

fitting of the LFRM distribution by the mean flood and CV of annual maximum

flood series.

Also further validation, analysis and testing should include deriving uncertainty

limits and comparing the LFRM estimates to those obtained from rainfall runoff

modelling.

The steps outlined for future research on the uncertainty in regional flood estimates in the

range of 2 – 100 years ARI and the LFRM involve considerable time and effort and were

considered to be beyond the scope of this thesis.

REFERENCES

268

REFFRENCES

Acreman, M.C., Sinclair, C.D., 1986. Classification of drainage basins according to their physical characteristics: an application for flood frequency analysis in Scotland. J. Hydrol. 84, 365-380. Acreman, M.C., 1987. Regional flood frequency analysis in the UK: Recent research-new ideas. Rep. Inst. of Hydro. Wallingford, UK. Acreman, M.C., Wiltshire, S.E., 1987. Identification of regions for regional flood frequency analysis. EOS. 68 (44), 1262 (Abstract). Ahmad, M.I., Sinclair, C.D., Werrity, A., 1988. Log – logistic flood frequency analysis. J. Hydrol. 98, 205-224. Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Autom. Cont. 19 (6), 716-722. Alexander, G.N., 1954. Some aspects of time series in hydrology. J. Inst. Eng. Aust. 26, 188-198. Alila, Y.P., Adamowski, K., Pilon, J., 1992. Regional homogeneity testing of low-flows using L moments. In: Proceedings of 12th conference on probability and statistics in the Atmospheric sciences, 5th International Meeting on Statistical Climatology, Toronto, Ont., 22-26 June, 1992. Anderson, H.W., 1957. Relating sediment yield to watershed variables. Trans. Am. Geophys. Union. 38, 921-924. Ashkanasy, N.M., 1985. To Bayes or not to Bayes – The future direction of statistical approaches in hydrology. Hydrology and Water Resources Symposium, 1985, Sydney, 14-16 May. Baratti, E., Montanari, A., Castellarin, A., Salinas, J.L., Viglione, A., Bezzi, A., 2012. Estimating the flood frequency distribution at seasonal and annual time scale. Hydrol. Earth Syst. Sci. 9, 7947-7967. Bates, B.C., 1994. Regionalisation of hydrological data: A review. Report 94/5. CRC for Catchment Hydrology, Monash University, Australia, pp 61. Bates, B.C., Rahman, A., Mein, R.G., Weinmann, P.E., 1998. Climatic and physical factors that influence the homogeneity of regional floods in south-eastern Australia. Water Resour. Res. 34 (12), 3369-3382. Benson, M.A., 1959. Channel slope factor in flood frequency analysis. J. Hydraul. Div. ASCE, 85, (HY4), 1-19. Benson, M.A., 1962. Evolution of methods for evaluating the occurrence of floods. U.S. Geol Surv. Water Supply Paper, 1580-A, 30pp.

REFERENCES

269

Benson, M.A., 1968. Uniform flood frequency estimating methods for federal agencies. Water Resour. Res. 4 (5), 981-908. Benson, M. A., Matalas, N.C., 1967. Synthetic hydrology based on regional statistical parameters. Water Resour. Res. 3 (4), 931-935. Bernier, J., 1967. Sur la thėorie du renouvellement et son application en hydrologie. Electricitė de France, Hyd 67 (10), 32. (in French) Bobėe, B., Cavidas, G., Ashkar, F., Bernier, J., Rasmussen, P., 1993. Towards a systematic approach to comparing distributions used in flood frequency analysis. J. Hydrol. 142, 121-136. Bocchiola, D., De Michele, C., Rosso, R., 2003. Review of recent advances in index flood estimation. Hydrol. Earth Syst. Sci. 7(3), 283-296. Brath, A., Castellarin, A., Montanari., A., 2003. Assessing the reliability of regional depth-duration-frequency equations for gauged and ungauged sites. Water Resour. Res. 39 (12), 1367, doi:10.1029/2003WR002399. Breiman, L., Freidman, J.H., Olsen, R.A., Stone, C., 1984. Classification and regression trees. Wadsworth: Belmont, CA. Buishand, T.A., 1984. Bivariate extreme-value data and the station-year method. J. Hydrol. 69, 77-95. Bunke, O., Droge. B., 1984. Bootstrap and cross-validation estimates of the prediction error for linear regression models. Annal. Statist. 12 (4), 1400-1424. Bureau of Meteorology, 2012. State of the Climate 2012. Australian Bureau of Meteorology and CSIRO. Burman, P.A., 1989. A comparative study of ordinary cross validation, v-fold cross-validation and repeated learning-tested methods. Biometrika. 76, 503-514. Burn, D.H., 1990a. An appraisal of the “region of influence” approach to flood frequency analysis. Hydrol. Sci. J. 35 (2), 149-165. Burn, D.H., 1990b. Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res. 26 (10), 2257-2265. Calenda G., Mancini C.P., Volpi, E., 2009. Selection of the probabilistic model of extreme floods: The case of the River Tiber in Rome. J Hydrol. 27, 1-11. Casella, G., George, E.I., 1992. Explaining the Gibbs sampler. Amer. Statist. Assoc. 46 (3), 167-174. Castellarin, A., 2007. Probabilistic envelope curves for design flood estimation at ungauged sites, Water Resour. Res. 43, W04406, doi:10.1029/2005WR004384. Castellarin, A., Vogel, R.M., Matalas, N. C., 2005. Probabilistic behaviour of a regional envelope curve. Water Resour. Res. 41, W06018, doi:10.1029/2004WR003042.

REFERENCES

270

Castellarin, A., Vogel, R.M., Matalas, N.C., 2007. Multivariate probabilistic regional envelopes of extreme floods, J. Hydrol. 336, 376-390 Castellarin, A., Merz, R., Blöschl, G., 2009. Probabilistic envelope curves for extreme rainfall events. J. Hydrol. 378, 263-271. Castiglioni, S., Castellarin, A., Montanari, A., 2009. Prediction of low-flow indices in ungauged basins through physiographical space-based interpolation. J. Hydrol. 378, 272-280, doi:10.1016/j.jhydrol.2009.09.032. Chebana, F., Ouarda, T.B.M.J., 2008. Depth and homogeneity in regional flood frequency analysis. Water Resour. Res. 44 (11), W11422, doi:10.1029/2007WR006771. Chow, V.T., Maidment, D.R., Mays, L.W., 1988. Applied Hydrology. McGraw-Hill, USA. Chowdhury, J.U., Stedinger, J.R., Lu, L.-H., 1991. Goodness of fit tests for regional flood distributions. Water Resour. Res. 27 (7), 1765-1776. Chowdhury, S., Sharma A., 2009. Multi-site seasonal forecast of arid river flows using a dynamic model combination approach. Water Resour. Res. 45, W10428, doi:10.1029/2008WR007510. Cohn. T.A., Lane, W.L., Baier, W.G., 1997. An algorithm for computing moments based flood quantile estimates when historical flood information is available. Water Resour. Res. 33 (9), 2089-2096. Coles, S., 2001. An introduction to statistical modelling of extreme values. London: Springer. Congdon, P., 2001. Bayesian statistical modelling, John Wiley & Sons, Ltd, West Sussex. Cooley, D., Naveau, P., Poncet, P., 2006. Variables for spatial max-stable random fields. Chapter of the book Statistics for dependant data (Lecture Notes in Statistics, Springer) doi: 10.1007/0-387-36062-X_17. Cooley D., Davis R., Naveau P., 2010. The pairwise Beta distribution: A flexible parametric multivariate model for extremes. J. Mult. Anals. 101, 2103-2117. Cunderlik, J.M., Burn, D.H., 2003. Non-stationary pooled flood frequency analysis. J. Hydrol. 276, 210-223. Cunnane, C., 1988. Methods and merits of regional flood frequency analysis. J. Hydrol. 100: 269-290. Cunnane, C., 1989. Statistical Distributions for Flood Frequency Analysis. World Meteorological Organisation, Operational Hydrology Report. No 33. D'Agostino, R.B., Stephens, M.A., 1986. Goodness-of-Fit Techniques. Marcel Dekker, Inc. New York.

REFERENCES

271

Dales, M.Y., Reed, D.W., 1989. Regional flood and storm hazard assessment. Rep. No. 2, Institute of Hydrology, Wallingford, Oxon, UK. Dalrymple, T., 1960. Flood frequency analysis, Water Supply Paper 1543-A. U.S Geological Survey, Reston, VA. Dawdy, D.R., 1961. Variation of flood ratios with size of drainage area. U. S. Geol. Surv. Prof. Pap. 424-C, Paper C36. Dawdy, D.R., Griffis, V.W., Vijay, G., 2012. Regional flood-frequency analysis: How we got here and where we are going. J. Hydrol. Eng. 17, 953-959. Di Baldassarre, G., Castellarin, A., Brath A., 2006. Relationships between statistics of rainfall extremes and mean annual precipitation: an application for design-storm estimation in northern Italy. Hydrol. Earth Syst. Sci. 10, 589-601. Douglas, E.M., Vogel, R.M., 2006. The probabilistic behaviour of the flood of record in the United States. J. Hydrol. Eng. 11 (5), 482-488. Draper, N.R., Smith, H., 1981. Applied regression analysis, 2nd ed. John Wiley, New York. Dymond, J.R., Christian, R., 1982. Accuracy of discharge determined from a rating curve. Hydrol. Sci. J. 27, 493-504.

Efron, B., 1983. Estimating the error rate of a prediction rule.: Some improvements on cross-validation. J. Amer. Stat. Assoc. 78, 316-331. Efron, B., 1986. How biased is the apparent error rate of the prediction rule? J. Amer. Stat. Assoc. 81, 461-470. El Adlouni, S., Bobee, B., Ouarda, T.B.M.J., 2008. On the tails of extreme event distributions in hydrology. J. Hydrol. 355, 16-33. Eng, K., Tasker, G.D., Milly, P.C.D., 2005. An analysis of Region-of-Influence methods for flood frequency regionalisation in the Gulf-Atlantic rolling plains. J. Am. Water Resour. Assoc. 41 (1), 135-143. Eng, K., Milly, P.C.D., Tasker, G.D., 2007a. Flood regionalization: A hybrid geographic and predictor-variable Region-of-Influence regression method. J. Hydrol. Eng. 12 (6), 585-591. Eng, K., Stedinger, J.R., Gruber, A.M., 2007b. Regionalisation of streamflow characteristics for the Gulf-Atlantic Rolling Plains using Leverage-Guided Region-of-Influence regression. In: EWRI World Water & Environmental Resources Congress, American Society of Civil Engineers 2007. Faber, K., Kowalski, B.R., 1997. Propagation of measurement errors for the validation of prediction obtained by principal component regression and partial least squares. J. Chemo. 11, 181 – 238.

REFERENCES

272

Favre, A.C., El Adlouni, S., Perreault, L., Thiémonge, N., Bobée, B., 2004. Multivariate hydrological frequency analysis using copulas. Water Resour. Res. 40 (1), W01101. Feaster, T.D., Tasker, G.D., 2002. Techniques for estimating the magnitude and frequency of floods in rural basins of South Carolina, 1999. Water Resources Investigations Report 02-4140, U.S. Geological Survey: Columbia, South Carolina. Ferrari, E., Gabriele, S., Villani, P., 1993. Combined regional frequency analysis of extreme rainfalls and floods. Extreme Hydrological Events: Precipitation, floods and droughts. In Proc, Yokohama Symposium, July, 1993). IAHS Publ. no. 213. Fill, D.H., Stedinger J.R., 1995a. L moment and PPCC goodness-of-fit tests for the Gumbel distribution and effect of autocorrelation. Water Resour. Res. 31 (1), 225-229. Fill, D.H., Stedinger J.R., 1995b. Homogeneity tests based upon Gumbel distribution and a critical appraisal of Darymple’s test. J. Hydrol. 166, 81-105. Fill, H.D., Stedinger, J.R., 1998. Using regional regression within IF procedures and an empirical Bayesian estimator. J. Hydrol. 210, 128-145. Flavell, D.J., 1982. The rational method applied to small rural catchments in the south west of Western Australia, Hydrology and Water Resour. Symp, 49-53. Flavell, D.J., 1985. Australian Rainfall and Runoff revision. Civil College Tech. Report, Engineers Australia, 6 Sep 1985, pp. 1-4. Flavell, D.J., Belstead, B.S., 1986. Losses for design flood estimation in Western Australia, Hydrology and Water Resour. Symp. Fortin, V., Bernier, J., Bobée, B., 1997, Simulation, Bayes, and bootstrap in statistical hydrology Water Resour. Res. 33, (3), 439–448. doi:10.1029/96WR03355. Franks, S.W., Kuczera, G., 2002. Flood frequency analysis: evidence and implications of secular climate variability, New South Wales. Water Resour. Res. 38(5), 1062, (doi:10.1029/2001WR000232). French, R., 2002. Flaws in the rational method. 27th National Hydrology and Water Resources Symp. 20-23 May, Melbourne. Gaál, L., Kyselý, J., Szolgay, J., 2008. Region-of-influence approach to a frequency analysis of heavy precipitation in Slovakia. Hydrol. Earth Syst. Sci. 12, 825-839. Galea, G., Michel, C., Oberlin, G., 1983. Maximal rainfall on a surface – the epicentre coefficient of 1 to 48-hour rainfall. J. Hydrol. 66, 159-167. Gamble, S.K., Turner, K., Smythe, C., 1998. Application of the focussed rainfall growth estimation technique in Tasmania. Hydro Electric Corporation, Tasmania, Internal Report.

REFERENCES

273

Gaume, E., Gaál, L., Viglione, A., Szolgay, J., Kohnová, S., Blöschl., G., 2010. Bayesian MCMC approach to regional flood frequency analyses involving extraordinary flood events at ungauged sites. J. Hydrol. 394 (1-2), 101-117. doi:10.1016/j.jhydrol.2010.01.008. Geman, S,, Geman, D., 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence. 6, 721-741 Greene, W.H., 2003. Econometric Analysis. Prentice Hall, NJ. Griffis, V.W., Stedinger, J.R., 2004. LP3 Flood quantile estimators using at-site and regional information, Critical Transitions in Water and Environmental Resources Management, Proceedings World Water & Environmental Resources Congress, Salt Lake City, Utah, June 27 - July 1, 2004. Edited by G. Sehlke, D.F., Hayes and D. K. Stevens, ASCE, Reston, Virginia. Griffis, V.W., Stedinger, J.R., 2007. The use of GLS regression in regional hydrologic analyses. J.Hydrol. 204, 82-95. Grubbs, F.E., Beck, G., 1972. Extension of sample sizes and percentage points for significance tests of outlying observations. Technometrics. 4 (14), 847-853. Gruber, A.M., Stedinger, J.R., 2007. Models of regional skew based on Bayesian GLS regression. In: World Environmental & Water Resources Conference, Tampa, Florida, May 15 – 18, 2007. Gruber, A.M., Stedinger, J.R., 2008. Models of LP3 Regional skew, data selection, and Bayesian GLS regression. In: EWRI World Water & Environmental Resources Congress, American Society of Civil Engineers, Honolulu, HI, May 13-16, 2008. Guse, B., Castellarin, A., Thieken, A.H., Merz, B., 2009. Effects of intersite dependence of nested catchment structures on probabilistic regional envelope curves. Hydrol. Earth Syst. Sci. 6, 2845-2892. Guttman, N.B., 1993. The use of L-moments in the determination of regional precipitation climates. J, Clim. 6, 2309-2325. Hackelbusch, A., Micevski, M., Kuczera., G, Rahman, A., Haddad, K., 2009. Regional flood frequency analysis for eastern NSW: A region of influence approach using generalised least squares log-Pearson 3 parameter regression. In: 32nd Hydrology and Water resources Symp, Newcastle, 30th Nov to 3rd Dec 2009. Haddad, K., 2008. Design Flood Estimation in Ungauged Catchments Using a Quantile Regression Technique: Ordinary and Generalised Least Squares Methods Compared for Victoria, Masters (Honors) thesis, School of Engineering, The University of Western Sydney, New South Wales. Haddad, K., Rahman, A., 2008. Investigation on at-site flood frequency analysis in south-east Australia, IEM Journal, The J. Inst. Eng., Malaysia, 69 (3), 59-64.

REFERENCES

274

Haddad, K., Rahman, A., Weinmann, P.E., Kuczera, G., Ball, J.E., 2010a. Streamflow data preparation for regional flood frequency analysis: Lessons from south-east Australia. Aust. J. Water Resour. 14 (1), 17-32. Haddad, K., Zaman, M. and Rahman, A., 2010b. Regionalisation of skew for flood frequency analysis: a case study for eastern NSW. Aust. J. Water Resour., 14, 1, 33-41. Haddad. K., Rahman, A., 2011. Selection of the best fit flood frequency distribution and parameter estimation procedure: a case study for Tasmania in Australia. Stoch. Env. Res. Risk A. 25 (3), 415-428, doi: 10.1007/s00477-010-0412-1 Haddad, K., Rahman, A., Green, J., 2011a. Design rainfall estimation in Australia: A case study using L moments and generalized least squares regression. Stoch. Environ. Res. Risk Assess. 25 (6), 815-825. doi:10.1007/s00477-010-0443-7. Haddad, K., Rahman, A., Weinmann, P.E., 2011b. Estimation of major floods: applicability of a simple probabilistic model. Aust. J. Water Resour. 14 (2), 117-126. Haddad, K., Rahman, A., Kuczera, G. 2011c. Comparison of ordinary and generalised least squares regression models in regional flood frequency analysis: A case study for New South Wales. Aust. J. Water Resour. 15 (2), 1-12. Haddad, K., Rahman, A., 2012. Regional flood frequency analysis in eastern Australia: Bayesian GLS regression-based methods within fixed region and ROI framework: Quantile regression vs. parameter regression technique. J. Hydrol. 430-431, 142-161. Haddad, K., Rahman, A., Stedinger, J. R., 2012. Regional flood frequency analysis using Bayesian generalized least squares: a comparison between quantile and parameter regression techniques. Hydrol. Process. 26, 1008–1021. doi: 10.1002/hyp.8189. Hardison, C.H., 1971. Prediction error of regression estimates of streamflow characteristics at ungauged sites. U.S. Geol. Pap. 750-C, C228-C236. Hastings, W.K., 1970. Monte Carlo sampling methods using markov chains and their applications. Biometrika. 57, 97-109. Hewa, G.A., McMahon, T.A., Peel, M.C., Nathan, R.J., 2003. Identification of the most appropriate regression procedure to regionalise extreme low flows. 28th Intl. Hydrology and Water Resour. Symp. 10-13 Nov. 2003. Hosking, J.R.M., 1990. L moments: analysis and estimation of distributions using linear combinations of order statistics. J.R. Statist. Soc. Ser. B, 52 (1), 105-124. Hosking, J.R.M., Wallis, J.R., 1988. The effect of intersite dependence on regional flood frequency analysis. Water. Resour. Res. 24 (4), 588-600. doi: 10.1029/WR024i004p00588. Hosking, J.R.M., Wallis, J.R., 1991. Some statistics useful in regional frequency analysis. IMB Math. Res. Rep. RC 17096, IMB T.J. Watson Research Center, Yorktown Heights, N.Y., 23 pp. Hosking, J.R.M., Wallis, J.R., 1993. Some statistics useful in regional frequency analysis. Water Resour. Res. 29 (2), 271-281.

REFERENCES

275

Hosking, J.R.M., Wallis, J.R., 1997. Regional frequency analysis: an approach based on L-moments. Cambridge University Press, New York. Hosking, J.R.M., Wallis, J.R., Wood, E.F., 1985. An appraisal of the regional flood frequency procedure in the UK Flood Studies Report. Hydrol. Sci. J. 30, 85-109. Houghton, J.C., 1978. Birth of a parent: The Wakeby distribution for modelling flows. Water Resour. Res. 14 (6), 1105-1115. Iacobellis, V., Gioia, A., Manfreda, S., Fiorentino, M., 2011. Flood quantiles estimation based on theoretically derived distribtions: regional analysis in Southern Italy. Nat. Hazards Earth Syst. Sci. 11, 673-695, doi:10.5194/nhess -11-673-2011. Institution of Engineers Australia (I. E. Aust.) 1987, 2001. Australian Rainfall and Runoff: A Guide to Flood Estimation. Edited by D. H. Pilgrim, Vol. 1, I. E. Aust., Canberra. Interagency Advisory Committee on Water Data (IAWCD). 1982. Guidelines for Determining Flood Flow Frequency: Bulletin17-B (revised and corrected). Hydrol. Subcomm., Washington, DC, March 1982, pp. 28. IPCC 2007. The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC). Ishak, E., Haddad, K., Zaman, M., Rahman, A., 2011. Scaling property of regional floods in New South Wales Australia. Nat. Haz. J. doi:10.1007/s11069-011-9719-6. Ishak, E., Rahman. A., Westra, S., Sharma, A., Kuczera, G., 2013. Evaluating the non-stationarity of Australian annual maximum flood. J. Hydrol. Accepted. Jennings, M.E., Thomas, Jr., W.O., Riggs, H.C., 1994. Nationwide summary of U.S. Geological Survey Regional regression estimates for estimating magnitude and frequency of floods for ungaged sites, Water Resources Investigations Report 94-4002, U.S. Geological Survey: Reston, Virginia. Jin, M., Stedinger, J.R., 1989. Flood frequency analysis with regional and historical information. Water Resour. Res. 25 (5), 925-936. Johnston, J., 1972. Econometric Methods, Mc-Graw Hill, New York. Jothityangkoon, C., Sivapalan, M., 2003. Towards estimation of extreme floods: examination of the roles of runoff process changes and floodplain flows. J Hydrol. 281, 206-229. Juckem, P.F., Hunt, R.J., Anderson, M. P., Robertson, D. M., 2008. Effects of climate and land management change on streamflow in the driftless area of Wisconsin, J. Hydrol. 355, 123–130. Juraj, M., Ouarda, T.B.M.J., 2007. Regional flood-rainfall duration-frequency modeling of small ungaged sites. J. Hydrol. 345, 61-69, doi:10.1016/j.jhydrol.2007.07.011.

REFERENCES

276

Katz, R.W., Parlange, M.B., Naveau, P., 2002. Statistics of extremes in hydrology. Adv. Water Resour. 25, 1287-1304. Kendall, M.G., 1970. Rank correlation methods. 4th Edition, Griffen, London, 202 p. Khaliq, M.N., Ouarda, T.B.M.J., Ondo, J.-C., Gachon, P., Bobée, B., 2006. Frequency analysis of a sequence of dependent and/or non-stationary hydro- meteorological observations: A review. J. Hydrol. 329, (3-4), 534-552. Kidson, R., Richards, K.S., 2005. Flood frequency analysis: assumptions and alternatives. Prog. Phys. Geo. 29 (3), 392-410. doi: 10.1191/0309133305pp454ra. Kirby, W., 1972. Computer oriented Wilson-Hilferty transformation that preserves the first 3 moments and lower bound of the Pearson Type 3 distribution. Water Resour. Res. 10 (2), 220-222. Kitanidis, P. K., 1986. Parameter uncertainty in estimation of spatial functions: Bayesian analysis. Water Resour. Res. 22 (4), 499-507. Kjeldsen, T.R., Rosbjerg, D., 2002. Comparison of regional index flood estimation procedures Based on the extreme value type I distribution. Stoch. Env. Res. Risk A. 16, 358-373. Kjeldsen T.R., Jones, D.A., Bayliss, A.C., 2008. Improving the FEH Statistical Procedures for Flood Frequency Estimation, Final Research Report to the Environment Agency, R&D Project SC050050, CEH Wallingford, UK. Kjeldsen, T. R., Jones, D.A., 2009a. An exploratory analysis of error components in hydrological regression modelling. Water Resour. Res. 45, W02407, doi: 10.1029/2007WR006283. Kjeldsen, T.R., Jones, D.A., 2009b. A formal statistical model for pooled analysis of extreme floods. J. Hydrol. Res. 40 (5), 465-480, doi:10.2166/nh.2009.055. Kjeldsen, T.R., 2010. Modelling the impact of urbanization on flood frequency relationships in the UK. J. Hydrol. Res. 41 (5), 391-405, doi:10.2166/nh.2010.056. Koop, G., 2008. Introduction to Econometrics. John Wiley & Sons, Ltd, West Sussex, England. Kroll, C.N., Stedinger, J.R., 1999. Development of regional regression relationships with censored data. Water Resour. Res. (35) 3, 775-784. Kuczera, G., 1982. Combining site-specific and regional information: An emperical Bayes approach. Water Resour. Res. 18 (2), 306-314. Kuczera, G., 1982a. Robust flood frequency models. Water Resour. Res. 18 (2), 315-324. Kuczera, G., 1983a. A Bayesian surrogate for regional skew in flood frequency analysis. Water Resour. Res. 19 (3), 821-832.

REFERENCES

277

Kuczera, G., 1983b. Effect of sampling uncertainity and spatial correlation on an emperical Bayes procedure for combining site and regional information. J. Hydrol. 65, 373-398. Kuczera, G., 1992. Uncorrelated measurement error in flood frequency inference. Water Resour. Res. 28, 183-189. Kuczera, G., 1996. Correlated measurement error in flood frequency inference. Water Resour. Res. 32, 2119-2128. Kuczera, G., 1999a. Comprehensive at-site flood frequency analysis using Monte Carlo Bayesian inference. Water Resour. Res. 35 (5), 1551-1557. Kuczera, G., 1999b. FLIKE HELP, Chapter 2 FLIKE Notes, University of Newcastle. Kuczera, G., Parent, E., 1998. Monte Carlo assessment of parameter uncertainty in conceptual catchment models: the Metropolis algorithm. J. Hydrol. 211 (1-4), 69-85. Kuczera, G., Franks, S., 2005. At-site flood frequency analysis. Australian Rainfall and Runoff, Book IV, Draft Chapter 2. Kundzewicz, Z.W., Rosbjerg, D., Simonovic, S.P., Takeuchi, K., 1993. Extreme hydrological events in perspective. Extreme Hydrological Events: Precipitation, floods and droughts In Proc, Yokohama Symposium, July, 1993). IAHS Publ. no. 213. Laaha, G., Blöschl, G., 2007. A national low flow estimation procedure for Austria. Hydrol. Sci. J. 52(4), 625-644. doi:10.1623/hysj.52.4.625. Laio, F., 2004. Cramer-von Mises and Anderson-Darling goodness of fit tests for extreme value distributions with unknown parameters. Water Resour. Res. 40:W09308.doi:10.1029/2004WR003204. Laio, F., Di Baldassarre, G., Montanari, A., 2009. Model selection techniques for the frequency analysis of hydrological extremes. Water Resour. Res. 45:W07416.doi:10.1029/2007/WR006666. Lamontagne, J., Stedinger, J.R., Ferris, J., Knifong, D., Veilleux, A., Curry, D., 2011. Regional skews for 1-Day, 3-Day, 7-Day, 15-Day, and 30-day duration discharge for the central valley region of California, Report Series XXXX-XXXX, U.S. Geological Survey (in press). . Law G., Tasker, G.D., 2003. Flood-frequency prediction methods for unregulated streams of Tennessee, 2000. U.S. Geological Survey Water-Resources Investigations Report 03-4176. Leadbetter, M.R., Lindren, G., Rootzen, H., 1983. Extremes and related properties of random sequences and processes. New York: Springer. Leclerc, M., Ouarda T.B.M.J., 2007. Non-stationary regional flood frequency analysis at ungauged sites. J. Hydrol. 343, 254–265. Lim, Y.H., Voeller, D.L., 2009. Regional flood estimations in red river using L-Moment-based index-flood and Bulletin 17B procedures. J. Hydrol. Eng. 14, 1002-1016.

REFERENCES

278

Lu, L.-H., Stedinger, J.R., 1992. Sampling variance of normalized GEV/PWM quantile estimators and a regional homogeneity test. J. Hydrol. 138 (1–2), 223–245. Ludwing, A.H., Tasker, G.D., 1993. Regionalization of low-flow characteristics of Arkansas streams. U.S. Geological Survey Water-Resources Investigations Report 93-4013. Madsen, H., Rosbjerg, D., Harremoes, P., 1995. Application of the Bayesian approach in regional analysis of extreme rainfalls. Stoch. Hydrol. Hydraul. 9, 77-88. Madsen, H., Rosbjerg, D., 1997. Generalised least squares and empirical Bayes estimation in regional partial duration series index-flood modelling. Water Resour. Res. 33 (4), 771-782. Madsen, H., Pearson, C.P., Rosbjerg, D., 1997. Comparison of annual maximum series and partial duration series for modelling extreme hydrologic events, 2, Regional modelling. Water Resour. Res. 33 (4), 759-769. Madsen, H., Mikkelsen, P.S., Rosbjerg, D., Harremoes, P., 2002. Regional estimation of rainfall intensity duration curves using generalised least squares regression of partial duration series statistics. Water Resour. Res. 38 (11), 1-11. Madsen, H., Arnbjerg-Neilsen, K., Mikkelsen,P.S., 2009. Update of regional intensity-duration-frequency curves in Denmark: Tendency towards increased storm intensities. Atmos. Res. 92, 343-349. Majone, U., Tomirotti, M., 2004. A trans-national regional frequency analysis of peak flood flows. L’Aqua, 2/2004, 9-17. Majone, U., Tomirotti, M., Galimberti, G., 2007. A probabilistic model for the estimation of peak flood flows. Special Session 10, 32nd Congress of IAHR, Venice, Italy, July 1-6. Marin, C., 1983. Uncertainty in water resources planning, PhD thesis, Harvard Univ., Cambridge Mass. Marter, H., Martern, M., 2001. Multivariate analysis of quality: An introduction. John Wiley & Sons Ltd.: Chichester. Martins, E.S., Stedinger, J.R., 2000. Generalized maximum likelihood GEV quantile estimators for hydrologic data. Water Resour. Res. 28 (11), 3001- 3010. Martins, E.S., Stedinger, J.R., 2001. Historical information in a GMLE-GEV framework with partial duration and annual maximum series. Water Resour. Res. 37 (10), 2551-2557. Martins, E.S., Stedinger, J.R., 2002a. Cross-correlation among estimators of shape. Water Resour. Res. 38 (11), doi:10.1029/2002WR001589. Martins, E.S., Stedinger, J.R., 2002b. Efficient regional estimates of LP3 Skew using GLS regression. Proceedings of the ASCE Conference on Water Resources Planning and Management, May 19-22.

REFERENCES

279

Matalas, N.C., 1967. Mathematical assessment of synthetic hydrology. Water. Resour. Res. 3 (4), 937-945. Matalas, N.C., Benson, M.A., 1961. Effects on interstation correlation on regresson analysis. J. Geophys. Res. 66 (10), 3285-3293. Matalas, N.C., Gilroy, E.J., 1968. Some comments on regionalization in hydrologic studies. Water Resour. Res, 4 (6), 1361-1369. McConachy, F.L.N., Xuereb, K., Smythe, C.J., Gamble, S.K., 2003. Homogeneity of rare to extreme rainfalls over Tasmania. 28th International Hydrology and Water Resour. Symp, Wollongong, The Institution of Engineers, Australia. McCuen, R. H., Map skew???. 1979. J. Water Resour. Plan. and Manage. Div. ,ASCE, 105(WR2), 265-277 [with Closure 107(WR2), 582, 1981]. McCuen, R., Hromadka, T., 1988. Flood skew in hydrologic design on ungaged watersheds. J. Irrig. and Drain. Eng., 114, No. 2. McGilchrist, C.A., Woodyer, K.D., 1975. Note on a distribution free CUSUM technique. Technometrics, 17 (3), 321-325. Merz, R., Blöschl, G., 2005. Flood frequency regionalisation—spatial proximity vs. catchment attributes. J. Hydrol. 302, 283-306. Metropolis, N., Rosenbluth, A.W., Teller, A.H., Teller, E., 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1092. Micevski, T., Franks, S.W., Kuczera, G., 2006. Multidecadal variability in coastal eastern Australian flood data. J. Hydrol. 327, 219-225. Micevski, T., Kuczera, G., 2009. Combining site and regional flood information using a Bayesian Monte Carlo approach. Water Resour. Res. 45, W04405, doi: 10.1029/2008WR007173. Michaelsen, J. 1987. Cross-validation in statistical climate forecast models. J. Climate Appl. Meteor. 26, 1589-1600. Moisello, U., 2007. On the use of partial probability weighted moments in the analysis of hydrological extremes. Hydrol. Process. 21, 1265–1279. doi: 10.1002/hyp.6310. Moss, M.F., Karlinger, M.R., 1974. Surface water network design by regression simulation. Water Resour. Res. 10 (3), 427-433 Moss, M.E., Tasker, G.D., 1991. An intercomparison of hydrological network-design technologies. J. Hydrol. Sci. 36 (3), 209. Mosteller, F., Tukey, J.W. 1977. Data analysis and regression: A second course in statistics: Reading, Mass.:Addison-Wesley.

REFERENCES

280

Mulvany, T.J., 1851. On the use of self registering rain and flood gauges in making observations of the relation of rainfall and of flood discharge in a given catchment. Trans. ICE Ire, 4, 18-31. Nandakumar, N., Weinmann, P.E., Mein, R.G., Nathan, R.J., 1997. Estimation of extreme rainfalls for Victoria using the CRC-FORGE Method, Report 97/4, Monash University. Nandakumar, N., Weinmann, P.E., Mein, R.G., Nathan, R.J., 2000. Estimation of spatial dependence for the CRC-FORGE method. Proc. ‘Hydro 2000’ – 3rd International Hydrology and Water Resources Symposium, Perth. Inst. of Engineers, Australia, pp. 553-557. Nathan, R.J., McMahon, T.A., 1990. Identification of homogeneous regions for the purpose of regionalisation. J.Hydrol. 121 (4), 217-238. Nathan, R.J., Weinmann, P.E., 2001. The estimation of extreme floods – the need and scope for revision of our national guidelines, Aust. J. Water Eng. 1 (1), 40-50. National Research Council 1988. Estimating Probabilities of Extreme Floods: Methods and Recommended Research, 141pp., National Academy Press, Washington D.C., Natural Environment Research Council (NERC) 1975. Flood Studies Report, NERC, London. Ng, W. W., Panu, U. S., Lennox, W. C., 2007. Chaos based analytical techniques for daily extreme hydrological observations. J. Hydrol. 342, 17-41. Novotny, E.V., Stefan, H.G., 2007. Stream flow in Minnesota: Indicator of climate change. J. Hydrol. 334, 319– 333. O’Connell, D.R.H., Ostenaa, D.A., Levish, D.R., Klinger, R.E., 2002. Bayesian flood frequency analysis with paleohydrologic bound data. Water Resour. Res. 38 (5), 16 (1) to 16 (4). Olsen, J.R., Lambert, J.H. Haimes, Y.Y., 1999. Risk of extreme event under nonstationary conditions. Risk Analysis 18, 4, 497–510. Oncirculation, 2011. http://oncirculation.com/2012/05/22/20102011

Overeem, A., Buishand, A., Holleman, I., 2009. Rainfall depth-duration frequency curves and their uncertainties. J. Hydrol. 348, 124–134. Pandey, G.R., Nguyen, V.T.V., 1999. A comparative study of regression based methods in regional flood frequency analysis. J. Hydrol. 225, 92-101. Parrett, C., Vellieux, A., Stedinger, J.R., Barth, N.A., Knifong, D., Ferris, J.C., 2010. Regional skew for California and flood frequency for selected sites in the Sacramento-San Joaquin River Basin based on data through water year 2006, OFR XXXX, U.S. Geological Survey (in press). Pearson, C.P., 1991. New Zealand regional flood frequency analysis using L moments. J. Hydrol. New Zealand, 30 (2), 53-64.

REFERENCES

281

Pegram, G., 2002. Rainfall, rational formula and regional maximum flood – some scaling links. 27th National Hydrology and Water Resources Symp. 20-23 May, Melbourne. Pericchi L.R., Rodriguez-Iturbe, I., 1983. On some problems in Bayesian model choice in hydrology. The Statist. 32, 273-278. Pasquini, A.I., Depetris, P.J., 2007. Discharge trends and flow dynamics of South American rivers draining the southern Atlantic seaboard: An overview. J. Hydrol. 333, 385– 399. Petersen-Øverleir, A., Reitan, T. 2009. Accounting for rating curve imprecision in flood frequency analysis using likelihood-based methods. J. Hydrol. 366, 89-100. Picard, R.R., Cook, R.D., 1984. Cross-validation of regression models. J. Amer. Stat. Assoc. 21, 299-313. Pilgrim, D. H., 1986. Bridging the gap between flood research and design practice. Water Resour. Res. 22, Supplement, No. 9, 165S-176S. Pilgrim, D.H., 1986. Estimation of large and extreme floods, Civil Eng. Trans. Institute of Engineers Australia. CE28, 62-73. Pilgrim , D.H., Rowbottom, I.A., 1987. Chapter 13 – Estimation of large and extreme floods. In Pilgrim, D.H. (ed.), Australian Rainfall and Runoff: A Guide to Flood Estimation, I.E. Aust., Canberra. Pilgrim, D.H., Cordery, I., 1993. Flood runoff. In: Chapter 9, Handbook of Hydrology, edited by D. R. Maidment, McGraw-Hill, N.Y. Pilon, P.J., Adamowski, K., 1991. Asymptotic variance of flood quantile in log Pearson Type III distribution with historical information. J. Hydrol. 143, 481 503. Pilon, P.J., Adamowski, K., 1992. The value of regional information to flood frequency analysis using the method of L-moments. Can. J. Civ. Eng. 19 (1), 137-147. Potter, K.W., Walker, J.F., 1981. A model of discontinuous measurement error and its effects on the probability distribution of flood discharge measurements. Water Resour. Res. 17 (5), 1505-1509. Potter, K.W., Lettenmaier, D.P., 1990. A comparison of regional flood frequency estimation mean using a resampling method. Water Resour. Res. 26 (3), 424. Prudhomme, C., Jakob, D., Svensson, C., 2003. Uncertainty and climate change impact on the flood regime of small UK catchments. J. Hydrol. 277, 1-23. Pui, A., Lal, A., Sharma, A., 2011. How does the Interdecedal Pacific Oscillation affect design floods in Australia? Water Resour. Res. 47 (5), doi:10.1029/2010wr009420. Racine, J., 2000. Consistent cross-validitory method for dependant data: hv-block cross validation. J. Econ. 99, 39 – 61.

REFERENCES

282

Rahman, A., 1997. Flood Estimation for ungauged catchments: A regional approach using flood and catchment characteristics, PhD thesis, Department of Civil Engineering, Monash University. Rahman, A., Bates, B.C., Mein, R.G., Weinmann, P.E., 1999a. Regional flood frequency analysis for ungauged basins in south – eastern Australia. Aust. J. Water Resour. 3 (2), 199-207. Rahman, A., Weinmann, P.E., Mein R.G., 1999b. At-site flood frequency analysis: LP3- product moment, GEV-L moment and GEV-LH moment procedures compared In Proc 2nd Intl. Conference on Water Resour. and Env. Research,I.E Aust., 6-8 July, 1999; 2, pp715-720. Rahman, A., Hollerbach, D., 2003. Study of runoff coefficients associated with the Probabilistic Rational Method for flood estimation in South-east Australia. Proc. 28th Hydrology and Water Resources Symp. 10-13 Nov., Wollongong, pp. 199-203. Rahman, A., Haddad, K., Kuczera, G. and Weinmann, P.E., 2009. Regional flood methods for Australia: data preparation and exploratory analysis. Australian Rainfall and Runoff Revision Projects, Project 5 Regional Flood Methods, Stage I Report No. P5/S1/003, Nov 2009, Engineers Australia, Water Engineering, 181pp. Rahman, A., Haddad, K., Zaman, M., Ishak, E., Kuczera, G., Weinmann, P.E., 2011a. Regional flood methods, Stage II, Project 5 Report, School of Engineering, University of Western Sydney, Australia. Rahman, A., Haddad, K., Zaman, M., Kuczera, G., Weinmann, P.E., 2011b. Design flood estimation in ungauged catchments: A comparison between the Probabilistic Rational Method and Quantile Regression Technique for NSW. Aust. J. Water Resour. 14 (2), 127-140. Rao, C. R., Toutenburg, H., 1999. Linear models: Least squares and alternatives, Springer-Verlag, New York. Rao, R.A., Hamed, K., 2000. Flood frequency analysis. CRC Press LCC, 2000 NW Corporate Blvd., Boca Ranton, Florida. Reich, B.J., Shaby, B.A., 2012. A hierarchical max-stable spatial model for extreme precipitation. Accepted, Ann. Appl. Stat. Reis, Jr., D.S., Stedinger, J.R., Martins, E.S., 2003. Bayesian GLS regression with application to LP3 regional skew estimation. Proceedings World Water & Environmental Resources Congress 2003, Editors P. Bizier and P. DeBarry, Philadelphia, PA, American Society of Civil Engineers, June 23-26, 2003. Reis Jr., D.S., 2005. Flood frequency analysis employing Bayesian regional regression and imperfect historical information. PhD thesis, Cornell University. P.210. Reis Jr., D.S., Stedinger, J.R., 2005. Bayesian MCMC flood frequency analysis with historical information. J Hydrol. 313, 97-116.

REFERENCES

283

Reis Jr., D.S., Stedinger, J.R., Martins, E.S., 2005. Bayesian GLS regression with application to LP3 regional skew estimation. Water Resour. Res. 41, W10419, doi:10.1029/2004WR00344. Reitan, T., Petersen-Øverleir, A., 2008. Bayesian power-law regression with a location parameter, with applications for construction of discharge rating curves. Stoch. Env. Res. Risk A. 22, 351-365. Rencher, A. C., 2000. Linear models in statistics, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc, 2000. Riggs, H.C., 1973. Regional analyses of streamflow techniques. Techniques of Water Resources Investigations of the U.S. Geol. Surv., Book 4, Chapter B3, U.S. Geol. Surv., Washington D.C. Robson, A. J., Reed, D.W., 1999. Flood estimation handbook Vol 3: Statistical procedures for flood frequency estimation. Institute of Hydrology, Wallingford, United Kingdom. Rosbjerg, D., Madsen, H., 1994. Uncertainty measures of regional flood frequency analysis estimators. J. Hydrol. 167, 209-224. Rosbjerg, D., 2007. Regional flood frequency analysis. Hydrological events: New concepts for security. Springer Netherlands. Doi: 10.1007/978-1-4020-5741-0_12. Rossi, F., Fiorention, M., Versace, P., 1984. Two –component extreme value distribution for flood frequency analysis. Water Resour. Res. 20 (7), 847-856. Rosso, R., 1985. A linear approach to the influence of discharge measurement error on flood estimates. Hydrol. Sci. J. 30, 137-254. Rowbottom, I.A., Pilgrim, D.H., Wright, G.L. 1986. Estimation of rare floods (between the probable maximum flood and the 1 in 100 flood). Civil Eng. Trans. Institute of Engineers Australia. CE28, 92-105. Salas, J.D., Wold, E.E., Jarrett, R.D., 1994. Determination of flood characteristics using systematic, historical and paleoflood data, in G. Rossi et al., Coping with Floods, 111-134, Kluwer Academic Publishers, Netherlands. Sankarasubramanian, A., Lall, U. 2003. Flood quantiles in a changing climate: Seasonal forecasts and causal relations. Water Resour. Res. 39, 51134, doi:10.1029/2002WR001593. Scholz, F.W., Stephens, M.A., 1987. K-sample Anderson-Darling Tests, J. Am. Statist. Assoc. 82, 918–924. Schwarz, G., 1978. Estimating the dimension of a model. Ann. Stat. 60 (2), 461-464. Shao, J., 1993. Linear model selection by cross validation. J. Amer. Stat. Assoc. 88, 486-494. Shuzheng, C., Yinbo, X., 1987. The effect of discharge measurement error in flood frequency analysis. J. Hydrol. 96, 237-254.

REFERENCES

284

Sivapalan, M., Takeuchi, K., Franks, S.W., Gupta, V.K., Karambiri, H., Lakshmi, V., Liang, X., McDonnell, J.J., Mendiondo, E.M., O’Connell, P.E., Oki, T., Pomeroy, J.W., Schertzer, D., Uhlenbrook, S., Zehe, E., 2003. IAHS Decade on predictions in ungauged basins (PUB), 2003-2012: Shaping an exciting future for the hydrological sciences. Hydrol. Sci. J. 48 (6), 857-880. Smith, J.A., 1987. Estimating the upper tail of flood frequency distributions. Water Resour. Res. 23 (18), 1657-1666. Smith, J.A., 1992. Representation of basin scale in flood peak distributions. Water Resour. Res. 28 (11), 2993-2999. Song Xu, Q., Zeng Liang, Y., 2001. Monte Carlo cross validation. Chemo. Int. Lab. Sys. 56, 1-11. Song Xu, Q., Zeng Liang, Y., Ping Du, Y., 2005. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J. Chemo. 18, 112-120, doi:10.1002/cem.858. Stedinger, J.R., 1983a. Estimating a regional flood frequency distribution, Water Resour. Res. 19 (2), 503-510. Stedinger, J.R., Cohn, T.A., 1986. Flood frequency analysis with historical and paleoflood information. Water Resour. Res. 22 (5), 785- 793. Stedinger, J.R., Lu, L.H., 1995. Appraisal of regional and index flood quantiles estimators. Stoch. Hydrol. Hydraul. 9 (1), 49-75. Stedinger, J.R., Tasker, G.D., 1985. Regional hydrologic analysis, 1.Ordinary, weighted, and generalised least squares compared. Water Resour. Res. 21(9), 1421-1432. Stedinger, J.R., Tasker, G.D., 1986. Correction to Regional hydrologic analysis, 1.Ordinary, weighted, and generalised least squares compared. Water Resour. Res. 22 (5), 844. Stedinger, J.R., Tasker, G.D., 1986. Regional hydrologic analysis, 2. Model error estimators, estimation of sigma and log – Pearson type 3 distributions. Water Resour. Res. 22 (10), 1487-1499. Stedinger, J.R., Vogel, R.M., Foufoula-Georgiou, E., 1993. Frequency analysis of extreme events, in Handbook of Hydrology, McGraw Hill Book Co., NY, pp. 18.1-18.66 (Chapter 18). Stewart, E.J., Reed, D.W., Faulkner, D.S., Reynard, N.S., 1999. The FORGEX method of rainfall growth estimation I: Review of requirement. Hydrol. Earth Syst. Sci. 3 (2), 187-195. Stone, M., 1974. Cross validatory choice and assessment of statistical predictions. J. Royal Stat. Soc. 36 (2), 111-147. Strahler, A.N., 1950. Equilibrium theory of erosional slopes approached by frequency distribution analysis. Amer. J. Sci. 248, 673- 696, 800- 814.

REFERENCES

285

Sun, R., Chen, L., Bojie, F., 2011. Predicting monthly precipitation with multivariate regression methods using geographic and topographic information. J. Phys. Geo. 32 (3), 269-285. doi: 10.2747/0272-3646.32.3.269. Svensson, C., Jones, D.A., 2010. Review of rainfall frequency estimation methods. J. Flood Risk Manag. 3, 296-313, doi:10.1111/j.1753-318X.2010.01079.x Tasker, G.D., 1980. Hydrologic regression and weighted lest squares, Water Resour. Res. 16 (6), 1107-1113. Tasker, G.D., 1989. Regionalization of low flow characteristics using Logistic and GLS regression. In Proceedings of Symposium on New Directions for Surface Water Modeling, Baltimore, IAHS Publ. No. 181, 323-331. Tasker, G.D., Driver, N.E., 1988. Nationwide regression model for predicting urban Runoff water quality at unmonitored sites. Water Resour. Bul. 24 (5), 1091-1101. Tasker, G.D., Eychaner, J.H., Stedinger, J.R., 1986. Application of generalised least squares in regional hydrologic regression analysis. US Geol. Survey Water Supply Paper 2310, 107-115. Tasker, G.D., Hodge, S.A., Barks C.S., 1996. Region of Influence regression for estimating the 50-year flood at ungauged sites. Water Resour. Bull. 32(1), 163-170. Tasker, G.D., Moss, M.E., 1979. Analysis of Arizona flood data network for regional information. Water Resour. Res. 15 (6), 1791-1796. Tasker, G.D., Stedinger, J.R., 1986. Estimating generalised skew with weighted least squares regression, J. Water Resour. Plan. and Manage. 112 (2), 225-237. Tasker, G.D., Stedinger, J.R., 1987. Regional regression of flood characteristics employing historical information. In: W.H. Kirby, S.Q. Hua and L.R. Beard (ed), Analysis of Extra-ordinary flood events. J. Hydrol. 96, 255-264. Tasker, G.D., Stedinger, J.R., 1989. An Operational GLS model for hydrologic regression. J. Hydrol. 111, 361-375. Thomas, D.M., Benson, M.A., 1970. Generalization of streamflow characteristics from drainage basin characteristics. US Geological Survey Water Supply Paper 1975, pp. 55. Thomas, Jr., W.O., Olsen, S.A., 1992. Regional analysis of minimum streamflow. In: Proceedings of 12th Conference on Probability and Statistics in the Atmospheric Sciences, 5th International Meeting on Statistical Climatology, Toronto, Ont., 22-26 June, 1992, pp 261-266. Tsakiris, G., Nalbantis, I., Cavadias, G., 2011. Regionalization of low flows based on Canonical Correlation Analysis. Adv. Water Resour. 34, 865-872, doi: 10.1016/j.advwatres.2011.04.007. Tung, Y., Mays, L., 1981a. Generalized skew coefficients for flood frequency analysis. Water Resour. Bul. 17, No. 2.

REFERENCES

286

Tung, Y., Mays, L., 1981b. Reducing hydrologic parameter uncertainty. J. Water Resour. Plan. and Manage. Div. 107, No. WR1.

Van Gelder, P.H.AJ.M., Wang, W., Vrijling, J.K., 2007. Statistical estimation methods for extreme hydrological events. O.F. Vasiliev et al. (eds), Extreme hydrological events: New concepts for security, 199-252, Springer. Vannitsem, S., Naveau, P., 2007. Spatial dependences among precipitation maxima over Belgium. Nonlin. Processes. Geophys. 14, 621-630. Veilleux, A.G., Stedinger, J.R., Lamontagne, J.R., 2011. Bayesian WLS/GLS regression for regional skewness analysis for regions with large cross-correlations among flood flows. In: EWRI World Environmental and Water Resources Congress Palm Springs, California, United States, May 22-26, 2011. Venetis, C., 1970. A note on the estimation of the parameters in a logarithmic stage-discharge relationships with estimation of their error. Bulletin IASH. 15, 105-111. Vogel, R.M., Kroll, C.N., 1989. Low – frequency analysis using probability – plot correlation coefficients. J. Water Resour. Plann. Mgmt. ASCE, 115 (3), 338-357. Vogel, R.M., Kroll, C.N., 1990. Generalised low-flow frequency relationships for ungauged sites in Massachusetts. Water Resour. Bul. (26) 2, 241-253. Vogel, R.M., McMahon, T.A., Chiew, F.H.S., 1993. Flood flow frequency model selection in Australia. J. Hydrol. 146, 421-449. Vogel, R.M., Matalas. N.C., England, J.F., Castellarin, A., 2007. An assessment of exceedance probabilities of envelope curves. Water Resour. Res. 43, W07403, doi:10.1029/2006WR005586. Vrac, M., Naveau, P., Drobinski, P., 2007. Modeling pairwise dependencies in precipitation intensities. Nonlin. Processes. Geophys. 14, 789-797. Wallis, J.R., Wood, E.F., 1985. Relative accuracy of Log Pearson 3 procedures. J. Hydrol. 111, 1043-1057. Williamson, D.R., Van Der Wel B., 1991. Quantification of the impact of dryland salinity on the Mount Lofty Ranges, SA, Intl. Hydrology and Water Resour. Symp, 48-52. Wiltshire, S.E., 1986a. Identification of homogeneous regions for flood frequency analysis. J. Hydrol. 84 (3-4), 287-302. Wiltshire, S.E., 1986b. Regional flood frequency analysis I: Homogeneity statistics. Hydrol. Sci. J. 31 (3), 321-333. WMO, 1994. Guide to hydrological practices: data acquisition and processing, analysis, forecasting and other applications. WMO-No. 168, Geneva. Wood, E.F., Rodriguez-Iturbe, I., 1975. Bayesian inference and decision making for extreme hydrological events. Water Resour. Res. 11 (4), 533-542.

REFERENCES

287

Xuereb, K.C., Moore, G.J., Taylor, B.F., 2001. Development of the method of storm transposition and maximisation for the West Coast of Tasmania. Bureau of Meteorology, Australia Hydrology Report Series, HRS Report No.7, 2001. Zaman, M., Rahman, A., Haddad, K., Hagare, D., 2012. Identification of best-fit probability distribution for at-site flood frequency analysis: A case study for Australia, Hydrology and Water Resources Symposium, Engineers Australia, 19-22 Nov 2012, Sydney, Australia Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics, John Wiley and Sons, Inc., New York. Zhang, P., 1993. Model selection via multifold cross validation. Annal. Statist. 21, 299-313. Zhu, Y., Day, R.L., 2005. Analysis of streamflow trends and the effects of climate in Pennsylvania, 1971 to 2001. J. American Water Resour. Assoc. 41 (6), 1393-1405. Zrinji, Z., Burn, D.H., 1996. Regional flood frequency with hierarchical region of influence. J. Water Resour. Plann. Mgmt. ASCE, 122 (4), 245–252.

APPENDIX A

288

APPENDIX A

A.1 PUBLISHED PAPERS FROM THIS RESEARCH

Haddad, K., Rahman, A., Zaman, M. and Shrestha, S. (2012). Applicability of Monte Carlo Cross Validation Technique for Model Development and Validation in Hydrologic Regression Analysis Using Ordinary and Generalised Least Squares Regression. Journal of Hydrology, (ERA, Rank A*, Accepted with minor revision). Haddad, K., and Rahman, A. (2012). Regional flood frequency analysis in eastern Australia: Bayesian GLS regression-based methods within fixed region and ROI framework: Quantile Regression vs. Parameter Regression Technique. Journal of Hydrology, DOI:10.1016/j.jhydrol.2012.02.012 (ERA, Rank A*). Haddad, K., Rahman, A. and Stedinger, J.R. (2012). Regional Flood Frequency Analysis using Bayesian Generalized Least Squares: A Comparison between Quantile and Parameter Regression Techniques. Hydrological Processes, 26(7), 1008-1021, DOI: 10.1002/hyp.8189 (ERA, Rank A)

Haddad, K., Rahman, A. and Kuczera, G. (2011). Comparison of Ordinary and Generalised Least Squares Regression Models in Regional Flood Frequency Analysis: A Case Study for New South Wales. Australian Journal of Water Resources, 15(2), 1-12 (ERA, Rank B).

Rahman, A., Haddad, K., Zaman, M., Kuczera, G. and Weinmann, P.E. (2011). Design flood estimation in ungauged catchments: A comparison between the Probabilistic Rational Method and Quantile Regression Technique for NSW. Australian Journal of Water Resources, 14(2), 127-140 (ERA, Rank B). Haddad, K., Rahman, A. and Weinmann, P.E. (2011). Estimation of major floods: applicability of a simple probabilistic model, Australian Journal of Water Resources, 14(2), 117-126 (ERA, Rank B).

Haddad, K., Rahman, A., Weinmann, P.E., Kuczera, G. and Ball, J.E. (2010). Streamflow data preparation for regional flood frequency analysis: Lessons from south-east Australia. Australian Journal of Water Resources, 14(1), 17-32 (ERA, Rank B) .

Haddad, K., Zaman, M. and Rahman, A. (2010). Regionalisation of skew for flood frequency analysis: a case study for eastern NSW. Australian Journal of Water Resources, 14(1), 33-41 (ERA, Rank B).

Haddad, K. and Rahman, A. (2010). Selection of the best fit flood frequency distribution and parameter estimation procedure – A case study for Tasmania in Australia, Stochastic Environmental Research & Risk Assessment, DOI: 10.1007/s00477-010-0412-1 (ERA, Rank B).

APPENDIX B

289

APPENDIX B

B.1 FURTHER RESULTS ASSOCIATED WITH VICTORIA AND

QUEENSLAND (FROM CHAPTER 5)

Table 52 Summary of the final BGLSR results for VIC

Posterior moment Statistics GLSR model

(VIC) Regression coefficient

Mean St

Dev AVPO AVPN AIC BIC BPV

%

2GLSR

j2 0.29 0.042

く0 (constant) 3.22 0.10 0 く1 (LN area) 0.61 0.040 0.31 0.29 0.31 0.32 0 63%

Mean µ

く2 (LN 2I12) 1.50 0.28 0

j2 0.043 0.012 Standard deviation j く0 (constant) 1.16 0.10 0 く1 (LN rain) -0.83 0.10 0.048 0.046 0.074 0.077 1 65% く2 (LN evap) 1.49 0.65 2

j2 0.034 0.027 Skewness けく0 (constant) -0.65 0.051 0

く1 (LN rain) 0.74 0.15 0.042 0.040 0.113 0.118 1 70% く2 (LN evap) -3.25 1.26 1


QARI=2 く0 (constant) 3.38 0.099 0 く1 (LN area) 0.90 0.089 0.28 0.27 0.29 0.30 0 63% く2 (LN Itc,ARI =2) 1.35 0.32 0

QARI=5 j2 0.29 0.043

く0 (constant) 4.17 0.10 0 く1 (LN area) 0.92 0.098 0.31 0.30 0.32 0.33 0 61% く2 (LN Itc,ARI =5) 1.32 0.35 0

QARI=10 j2 0.35 0.039


QARI=20 j2 0.35 0.036


QARI=50 j2 0.47 0.050


QARI=100 j2 0.59 0.067


APPENDIX B

290

Table 53 Summary of the final BGLSR results for QLD

Posterior moment Statistics GLSR model (QLD)

Regression coefficient

Mean St

Dev AVPO AVPN AIC BIC BPV

%

2GLSR

j2 0.23 0.032

く0 (constant) 4.71 0.074 0 く1 (LN area) 0.74 0.043 0.24 0.23 0.27 0.28 0 77%

Mean µ

く2 (LN 2I12) 1.97 0.15 0

j2 0.13 0.015 Standard deviation j く0 (constant) 1.37 0.10 0 く1 (LN area) -0.025 0.032 0.13 0.13 0.20 0.20 42 35% く2 (LN

2I12) -1.41 0.13 2

j2 0.015 0.014 Skewness けく0 (constant) -0.63 0.066 0

く1 (LN 50I72) -0.32 0.19 0.026 0.025 0.18 0.18 8 46% く2 (LN rain) 0.36 0.18 4


QARI=2 く0 (constant) 4.80 0.079 0 く1 (LN area) 1.35 0.078 0.27 0.26 0.28 0.29 0 75% く2 (LN Itc,ARI =2) 2.57 0.19 0

QARI=5 j2 0.17 0.026


QARI=10 j2 0.18 0.028


QARI=20 j2 0.14 0.025


QARI=50 j2 0.17 0.029


QARI=100 j2 0.20 0.033


APPENDIX B

291


region, VIC)


VIC)

-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

1 2 3 4 5 6 7

Fitted LN(Q 20)

Sta

ndar

dise

d R

esid

ual


-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

1.5 2.5 3.5 4.5 5.5 6.5 7.5

Fitted LN(Q 20)

Sta

ndar

dise

d R

esid

ual


APPENDIX B

292


region, VIC)


VIC)


-3-2.5

-2

-1.5-1

-0.50

0.51

1.52

2.53

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT

BGLSR-PRT

ARI 20 (ROI)

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT

BGLSR-PRT

APPENDIX B

293


region, QLD)


QLD)

-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

4 5 6 7 8 9

Fitted LN(Q 20)

Sta

ndar

dise

d R

esid

ual


-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9

Fitted LN(Q 20)

Sta

ndar

dise

d R

esid

ual

BGLSR-QRT ( ROI) BGLSR-PRT (ROI)

APPENDIX B

294


region, QLD)


QLD)


-3-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT

BGLSR-PRT

ARI 20 (ROI)

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-3 -2 -1 0 1 2 3


Nor

mal

Sco

re

BGLSR-QRT

BGLSR-PRT

APPENDIX C

295

APPENDIX C

C.1 FURTHER RESULTS ASSOCIATED WITH THE LFRM (FROM

CHAPTERS 7 AND 8)

VIC

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis


Figure 74 L-moment ratio diagram of annual maximum flood series data for VIC

WA

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis


Figure 75 L-moment ratio diagram of annual maximum flood series data for WA

APPENDIX C

296

SA

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis


Figure 76 L-moment ratio diagram of annual maximum flood series data for SA

TAS

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis


Figure 77 L-moment ratio diagram of annual maximum flood series data for TAS

APPENDIX C

297

NT

-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

-0.25 -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65

L-Skewness

L-K

urto

sis


Figure 78 L-moment ratio diagram of annual maximum flood series data for NT

0

4

8

12

16

20

24

1 10 100 1000 10000

ARI (Years)

Sta

ndar

dise

d D

ata

Observed Data

GEV (NSW)

GPA (NSW)

P3 (NSW)

Figure 79 Visual inspection of distributional fit for GEV, GPA and P3 distributions for NSW

APPENDIX C

298

0

2

4

6

8

10

12

1 10 100 1000 10000

ARI (Years)

Sta

ndar

dise

d D

ata

Observed data

GEV (VIC)

GPA (VIC)

P3 (VIC)

Figure 80 Visual inspection of distributional fit for GEV, GPA and P3 distributions for VIC

50250

2.2

2.0

1.8

1.6

1.4

50250

4.0

3.5

3.0

2.5

2.0

50250

9.0

7.5

6.0

4.5

3.0

3001500

2.0

1.9

1.8

1.7

1.6

3001500

4.0

3.5

3.0

2.5

3001500

8

7

6

5

4

N = 2

Experiment Number

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

TAS


Figure 81 Variation of Ne with different network methods and experiment number for TAS region (top

panel for real data and bottom panel for simulated data)

APPENDIX C

299

2.22.01.81.61.41.2

10.0

7.5

5.0

2.5

0.04.03.53.02.52.0

12

9

6

3

08.47.26.04.83.62.4

16

12

8

4

0

2.082.001.921.841.761.681.601.52

20

15

10

5

04.23.93.63.33.02.72.4

40

30

20

10

08.88.07.26.45.64.84.03.2

40

30

20

10

0

N = 2Fr

eque

ncy

of N

eN = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

2

0

1

7

99

5

10

88

12

3

8

13

8

4

7

10

5

10

12

5

3

11

8

10

17

23

12

20

14

10

17

13

2221

910

7

18

1415

11

17

15

4

15

23

11

3

10

30

1112

33

6

21

29

1

2224

6

42

8

16

34

1

4

17

29

1

7

27

14

3

0

18

28

5

0

12

30

9

4

36

11

0

2625

TAS

Figure 82 Frequency of Ne with different network methods for TAS region (top panel for real data and

bottom panel for simulated data)

100500

2.50

2.25

2.00

1.75

1.50

100500

5

4

3

2

100500

8

6

4

2

3001500

2.0

1.9

1.8

1.7

1.6

3001500

4.0

3.5

3.0

2.53001500

8

7

6

5

4

N = 2

Experiment Number

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

NT


Figure 83 Variation of Ne with different network methods and experiment number for NT region (top


APPENDIX C

300

2.42.22.01.81.61.4

24

18

12

6

04.84.23.63.02.41.8

16

12

8

4

09.07.56.04.53.01.5

16

12

8

4

0

2.082.001.921.841.761.681.60

30

20

10

04.23.93.63.33.02.72.4

40

30

20

10

08.88.07.26.45.64.84.03.2

40

30

20

10

0

N = 2Fr

eque

ncy

of N

eN = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

10

22

10

14

26

21

15

9

22

001

3

67

8

1312

910

16

555

6

11

7

14

54

3

7

13

11

15

4

2

1

5

24

18

15

19

11

24

18

11

7

18

14

11

14

1615

11

14

27

11

21

8

33

9

13

34

7

2325

2

30

19

10

35

15

36

6

11

37

3

0

10

27

14

00

34

17

01

22

26

2

10

37

5

32

18

NT

Figure 84 Frequency of Ne with different network methods for NT region (top panel for real data and


3001500

2.5

2.0

1.5

1.03001500

5

4

3

2

3001500

8

6

4

2

3001500

2.0

1.9

1.8

1.7

1.6

3001500

4.0

3.5

3.0

2.53001500

8

7

6

5

4

N = 2

Experiment Number

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

WA


Figure 85 Variation of Ne with different network methods and experiment number for WA region (top


APPENDIX C

301

2.22.01.81.61.41.2

24

18

12

6

05.04.54.03.53.02.52.01.5

40

30

20

10

09.68.47.26.04.83.62.4

30

20

10

0

2.082.001.921.841.761.681.60

20

15

10

5

04.23.93.63.33.02.72.4

40

30

20

10

08.88.07.26.45.64.84.03.2

40

30

20

10

0

N = 2

Freq

uenc

y of

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

3

0

3

1

9

12

17

21

18

24

18

12

23

15

1817

1515

24

67

33

6

222

7

15

24

34

21

4138

27

32

35

13

1

4

14

19

16

20

30

18

31

1312

13

25

28

2119

2

7

12

10

17

202020

10

151515

8

17

22

6

1919

13

4

16

14

16

5

2

9

30

14

21

24

6

38

11

7

37

7

2426

2

39

11

7

2321

0

15

23

13

2

12

33

4

0

9

33

9

0

23

27

1

7

41

3

WA

Figure 86 Frequency of Ne with different network methods for WA region (top panel for real data and


50250

2.0

1.8

1.6

1.4

1.2

50250

3.2

2.8

2.4

2.0

50250

4.5

4.0

3.5

3.0

2.5

3001500

2.0

1.9

1.8

1.7

1.6

3001500

4.0

3.5

3.0

2.53001500

8

7

6

5

4

N = 2

Experiment Number

Ne

N = 4 N = 8

N = 2 (SIM) N = 4 (SIM) N = 8 (SIM)

SA


Figure 87 Variation of Ne with different network methods and experiment number for SA region (top


APPENDIX C

302

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


R2 G

LSR

0.0080

0.0085

0.0090

0.0095

0.0100

0.0105

0.0110R-sqd GLSR MEV Standard error of MEV

Figure 88 Selection of predictor variables for the BGLSR model for CV - WA

0.00

0.05

0.10

0.15

0.20

0.25


AVPO AVPN AIC BIC


BIC - WA

APPENDIX C

303

0

0.2

0.4

0.6

0.8

1

1.2

1.4


MEV Standard error of MEV R-sqd GLSR

Figure 90 Selection of predictor variables for the BGLSR model for the mean flood – WA

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00


AVPO AVPN AIC BIC


AVPN, AIC and BIC - WA

APPENDIX C

304

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


R2 G

LSR

0.0140

0.0145

0.0150

0.0155

0.0160

0.0165

0.0170

0.0175

0.0180

0.0185R-sqd GLSR MEV Standard error of MEV

Figure 92 Selection of predictor variables for the BGLSR model for CV – TAS

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


AVPO AVPN AIC BIC


BIC - TAS

APPENDIX C

305

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


MEV Standard error of MEV R-sqd GLSR

Figure 94 Selection of predictor variables for the BGLSR model for the mean flood – TAS

0.00

0.50

1.00

1.50

2.00

2.50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


AVPO AVPN AIC BIC


AVPN, AIC and BIC - TAS

APPENDIX D

306

APPENDIX D

D.1 L-MOMENT RATIO DIAGRAMS AND GOODNESS-OF-FIT TEST

Hosking (1990) introduced the L-moment ratio diagram for the purpose of selecting

suitable distributions in frequency analysis. An L-moment ratio diagram compares sample

estimates of L-Skewness (LSK) and L-Kurtosis (LKT) with their population counterparts,

for a range of assumed distributions.

Hosking and Wallis (1991) presented a goodness-of-fit measure based on r , the regional

average of the sample (LKT) 4 , mainly for three–parameter distributions. Since all three–

parameter distributions fitted to the data will have the same (LSK) 3 on the LSK vs. L-

Coefficient of variation (LCV) LCV diagram, the quality of fit can be judged by the

difference between regional average 4 and the value of DIST

4 for the fitted distribution.

The statistic ZDIST is defined below:

444 /)( DISTDISTZ (D.1)

Which is a goodness-of-fit measure, 4 is the standard deviation of 4 . The value of

4 can be obtained by simulation after fitting a Kappa distribution to the observations

(Hosking, 1988). A fit is declared adequate if DISTZ is sufficiently close to zero, a

reasonable criterion being DISTZ 1.64.

APPENDIX D

307

D.2 ANDERSON-DARLING MONTE CARLO SIMULATION GOODNESS-

OF-FIT TEST

Given a sample xi (i = 1, . . . , n) of annual maximum flood data extracted from a

distribution FR(x), the test is used to check the null hypothesis H0 : FR(x) = F(x, θ), where

F(x, θ) is the hypothetical distribution and θ is an array of parameters estimated from the

sample xi. The Anderson-Darling (AD) goodness-of-fit test measures the departure between

the hypothetical distribution F(x, θ) and the cumulative frequency function Fn(x) defined as:

Fn(x) = 0 , x < x(1)

Fn(x) = i/n , x(i) x < x(i+1)

Fn(x) = 1 , x(n) x (D.2)

where x(i) is the i-th element of the ordered sample (in increasing order). The test statistic is:

)()(),()(2

2 xdFxxFxFnAx

n (D.3)

where )(x , in the case of the AD test (Laio, 2004), is )(x = [F (x,θ) (1- F (x,θ))]-1. In

practice, the statistic is calculated as:

n

i

ii xFinxFin

nA1

)()(

2 )],(1ln[)212()],(ln[)12(1 (D.4)

The statistic 2A , obtained in this way, may be compared with the population of the 2A ’s

that one obtains if the sample essentially belongs to the F (x, θ) hypothetical distribution. In

the case of the test of normality, this distribution is defined as can be seen in Laio (2004).

For other distributional cases, e.g. P3 or GEV the test statistics can be derived using Monte

Carlo simulation as done here. The results in Table 7.1 for the AD test are reported as P-

values for a significance level of 5%. Hence a value of P > 0.95 suggests that the particular

distribution as the parent is not significant / unsupported.

APPENDIX D

308

D.3 HOMOGENEITY TEST OF HOSKING AND WALLIS

The Hosking and Wallis test assesses the homogeneity of a group of catchments at three

different levels by focussing on three measures of dispersion for different orders of the

sample L-moment ratios (see Hosking (1990) for an explanation of L-moments).

A measure of dispersion for the LCV:

R

i

ii

R

i

i nttnV1

2

2)(2

1

1 )( (D.5)

A measure of dispersion for both the LCV and the LSK coefficients in the LCV-LSK

space:

R

i

i

R

i

iii nttttnV1

2/1

1

2

3)(3

2

2)(22 )()( (D.6)

A measure of dispersion for both the LSK and the LKT coefficients in the LSK-LKT space:

R

i

i

R

i

iii nttttnV1

2/1

1

2

4)(4

2

3)(33 )()( (D.7)

where 2t , 3t and 4t are the group mean of LCV, LSK and LKT respectively; )(2 it , )(3 it ,

)(4 it and in are the values of LCV, LSK, LKT and the sample size for i; and R is the number

of sites in the pooling group.

The underlying concept of this test is to measure the sampling variability associated with

the L-moment ratios and compare it with the variation that would be expected for a

homogenous group. The expected mean value and standard deviation of these dispersion

measures for a homogeneous group, kV and

kV respectively, are assessed through

repeated simulations, by generating homogeneous groups of catchments having the same

record length as those of the observed data following the methodology proposed by

Hosking and Wallis (1993). The heterogeneity measures are then evaluated using the

following expression:

APPENDIX D

309

3,2,1for;)H( kV

k

k

k

V

Vk

(D.8)

Hosking and Wallis (1993) suggested that the region or group of sites should be considered

as ‘acceptably homogeneous’ if H < 1; ‘possibly heterogeneous’ if 2H1 , and

‘definitely heterogeneous’ if H 2 .

D.4 THE BOOTSTRAP ANDERSON-DARLING HOMOGENEITY TEST

The AD test is based on the comparison between the local and regional empirical

distribution functions. The empirical distribution function, or sample distribution function,

is defined by njxF /)( , ,)1()( jj xxx where n is the sample and )( jx are the order

statistics, i.e. the observations arranged in ascending order. Denote the empirical

distribution function of the i-th sample (local) by )(ˆ xFi , and that of the pooled sample of all

knnN ...1 (regional) by )(xH N .

The k-sample Anderson-Darling test statistic is then defined as:

)()](1)[(

)]()(ˆ[ 2

1

xdHxHxH

xHxFn N

xall NN

Nik

i

iAD

(D.9)

If the pooled ordered sample is ,...1 NZZ the computational formula to evaluate AD is:

k

i

N

j

iij

i

ADjNj

jnNM

nN 1

1

1

2

)(

)(11 (D.10)

where ijM is the number of observations in the i-th sample that are not greater than jZ . The

homogeneity test can be carried out by comparing the obtained AD value to the tabulated

percentage points reported by Scholz and Stephens (1987) for the different significance

levels.

APPENDIX D

310

The statistic AD depends on the sample values only through their ranks. This guarantees

that the test statistic remains unchanged when the samples undergo monotonic

transformation, an important stability property not possessed by the Hosking and Wallis

(1993) heterogeneity measure. However, problems arise in applying this test in a common

index value procedure. In fact, the index value procedure corresponds to dividing each site

sample by a different value, thus modifying the ranks in the pooled sample. In particular,

this has an effect of making the local empirical functions much more similar to each other,

providing an impression of homogeneity even when the samples are highly heterogeneous.

The effect is equivalent to that encountered when applying goodness-of-fit tests to

distributions whose parameters are estimated from the same sample used for the test (e.g.

D’Agostino and Stephens, 1986 and Laio, 2004). In both the cases, the percentage points

for the test should be opportunely recalculated. This may be achieved with a nonparametric

bootstrap approach, which is presented in the following steps:

1. Build up the pooled sample S of the observed non-dimensional data.

2. Sample with replacement from S and generate k simulated local samples of

size knn ,...,1 .

3. Divide each sample for its index value and calculate )1(

AD .

4. Repeat the procedure for Nsim times and obtain a sample of )( j

AD , j = 1,…, Nsim

values, whose empirical distribution function can be used as an approximation

),( ADH oG the distribution of AD under the null hypothesis of homogeneity.

The acceptance limits for the test, corresponding to any significance level , are then easily

determined as the quantiles of )( ADH oG corresponding to probability (1- ). The result is

usually reported as a P-value.

APPENDIX D

311

D.5 GUMBEL VARIATES CORRESPONDING TO ARI

Table 54 Values of YT corresponding to ARI

ARI YT

2 0.37

5 1.50

10 2.25

20 2.97

50 3.90

100 4.60

200 5.30

500 6.21

1000 6.91

2000 7.60

3000 8.01

Documents

Western Sydney University · 2013. 12. 8. · PRELIMINARIES iv by coupling it with the BGLSR – ROI technique to estimate the mean and coefficient of variation (CV) of AMFS data