35
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/282136084 Measuring Forecasting Accuracy: Problems and Recommendations (by the Example of SKU-Level Judgmental Adjustments) Chapter · October 2013 DOI: 10.1007/978-3-642-39869-8_4 CITATIONS 4 READS 1,007 2 authors: Some of the authors of this publication are also working on these related projects: Measuring forecasting accuracy View project Building forecasting models of judgment and observed data View project Andrey Davydenko JSC "CSBI" 6 PUBLICATIONS 74 CITATIONS SEE PROFILE Robert Fildes Lancaster University 156 PUBLICATIONS 5,004 CITATIONS SEE PROFILE All content following this page was uploaded by Andrey Davydenko on 29 November 2015. The user has requested enhancement of the downloaded file.

Measuring Forecasting Accuracy

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Measuring Forecasting Accuracy

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/282136084

Measuring Forecasting Accuracy: Problems and Recommendations (by the

Example of SKU-Level Judgmental Adjustments)

Chapter · October 2013

DOI: 10.1007/978-3-642-39869-8_4

CITATIONS

4READS

1,007

2 authors:

Some of the authors of this publication are also working on these related projects:

Measuring forecasting accuracy View project

Building forecasting models of judgment and observed data View project

Andrey Davydenko

JSC "CSBI"

6 PUBLICATIONS   74 CITATIONS   

SEE PROFILE

Robert Fildes

Lancaster University

156 PUBLICATIONS   5,004 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Andrey Davydenko on 29 November 2015.

The user has requested enhancement of the downloaded file.

Page 2: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

Measuring Forecasting Accuracy: Problems and Recommendations

(by the example of SKU-level judgmental adjustments)1

Andrey Davydenko2, Robert Fildes

Department of Management Science, Lancaster University, Lancaster, LA1 4YX, UK

Abstract Forecast adjustment commonly occurs when organizational forecasters

adjust a statistical forecast of demand to take into account factors which are ex-

cluded from the statistical calculation. This paper addresses the question of how to

measure the accuracy of such adjustments. We show that many existing error

measures are generally not suited to the task, due to specific features of the de-

mand data. Alongside the well-known weaknesses of existing measures, a number

of additional effects are demonstrated that complicate the interpretation of meas-

urement results and can even lead to false conclusions being drawn. In order to

ensure an interpretable and unambiguous evaluation, we recommend the use of a

metric based on aggregating performance ratios across time series using the

weighted geometric mean. We illustrate that this measure has the advantage of

treating over- and under-forecasting even-handedly, has a more symmetric distri-

bution, and is robust.

Empirical analysis using the recommended metric showed that, on av-

erage, adjustments yielded improvements under symmetric linear loss, while

harming accuracy in terms of some traditional measures. This provides further

support to the critical importance of selecting appropriate error measures when

evaluating the forecasting accuracy. The general accuracy evaluation scheme rec-

ommended in the paper is applicable in a wide range of settings including fore-

casting for fashion industry.

Keywords: judgmental adjustments, forecasting support systems, forecast accu-

racy, forecast evaluation, forecast error measures.

1 This paper is an extended version of Davydenko and Fildes (2013) which appeared in

the International Journal of Forecasting. 2 Corresponding author. E-mail: [email protected].

Page 3: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

1. Introduction

The most well-established approach to forecasting within supply chain compa-

nies starts with a statistical time series forecast, which is then adjusted by manag-

ers in the company based on their expert knowledge. This process is usually car-

ried out at a highly disaggregated level of SKUs (stock-keeping units), where

there are often hundreds if not thousands of series to consider (Sanders & Ritz-

man, 2004; Fildes & Goodwin, 2007). At the same time, the empirical evidence

suggests that judgments under uncertainty are affected by various types of cogni-

tive biases and are inherently non-optimal (Tversky & Kahneman, 1974). Such bi-

ases and inefficiencies have been shown to apply specifically to judgmental ad-

justments (Fildes, Goodwin, Lawrence, & Nikolopoulos, 2009). Therefore, it is

important to monitor the accuracy of judgmental adjustments in order to ensure

the rational use of the organisation’s resources which are invested in the forecast-

ing process.

The task of measuring the accuracy of judgmental adjustments is inseparably

linked with the need to choose an appropriate error measure. In fact, the choice of

an error measure for assessing the accuracy of forecasts across time series is itself

an important topic for forecasting research. It has theoretical implications for the

comparison of forecasting methods and is of wide practical importance, since the

forecasting function is often evaluated using inappropriate measures (see, for ex-

ample, Armstrong & Collopy, 1992; Armstrong & Fildes, 1995), and therefore the

link to economic performance may well be distorted. Despite the continuing inter-

est in the topic, the choice of the most suitable error measure for evaluating com-

panies’ forecasts still remains controversial. Due to their statistical properties,

popular error measures do not always ensure easily interpretable results when ap-

plied to real-world data (Hyndman & Koehler, 2006; Kolassa & Schutz, 2007). In

practice, the proportion of firms which track the aggregated accuracy is surprising-

ly small, and one apparent reason for this is the inability to agree on appropriate

accuracy metrics (Hoover, 2006). As McCarthy, Davis, Golicic, and Mentzer

(2006) reported, only 55% of the companies surveyed believed that their forecast-

ing performance was being formally evaluated.

The key issue when evaluating a forecasting process is the improvements

achieved in supply chain performance. While this has only an indirect link to the

forecasting accuracy, organisations rely on accuracy improvements as a suitable

proxy measure, not least because of their ease of calculation. This paper examines

the behaviours of various well-known error measures in the particular context of

demand forecasting in the supply chain. We show that, due to the features of SKU

demand data, well-known error measures are generally not advisable for the eval-

uation of judgmental adjustments, and can even give misleading results. To be

Page 4: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

useful in supply chain applications, an error measure usually needs to have the fol-

lowing properties: (i) scale independence – though it is sometimes desirable to

weight measures according some characteristic such as their profitability; (ii) ro-

bustness to outliers; and (iii) interpretability (though the focus might occasionally

shift to extremes, e.g., where ensuring a minimum level of supply is important).

The most popular measure used in practice is the mean absolute percentage er-

ror, MAPE (Fildes & Goodwin, 2007), which has long been being criticised (see,

for example, Fildes, 1992; Hyndman & Koehler, 2006; Kolassa & Schutz, 2007).

In particular, the use of percentage errors is often inadvisable, due to the large

number of extremely high percentages which arise from relatively low actual de-

mand values.

To overcome the disadvantages of percentage measures, the MASE (mean ab-

solute scaled error) measure was proposed by Hyndman and Koehler (2006). The

MASE is a relative error measure which uses the MAE (mean absolute error) of a

benchmark forecast (specifically, of the random walk) as its denominator. In this

paper we analyse the MASE and show that, like the MAPE, it also has a number

of disadvantages. Most importantly: (i) it introduces a bias towards overrating the

performance of a benchmark forecast as a result of arithmetic averaging; and (ii) it

is vulnerable to outliers, as a result of dividing by small benchmark MAE values.

To ensure a more reliable evaluation of the effectiveness of adjustments, this

paper proposes the use of an enhanced measure that shows the average relative

improvement in MAE. In contrast to MASE, it is proposed that the weighted geo-

metric average be used to find the average relative MAE. By taking the statistical

forecast as a benchmark, it becomes possible to evaluate the relative change in

forecasting accuracy yielded by the use of judgmental adjustments, without expe-

riencing the limitations of other standard measures. Therefore, the proposed statis-

tic can be used to provide a more robust and easily interpretable indicator of

changes in accuracy, meeting the criteria laid down earlier.

The importance of the choice of an appropriate error measure is justified by the

fact that previous studies of the gains in accuracy from the judgmental adjustment

process have produced conflicting results (e.g., Fildes et al., 2009; Franses & Leg-

erstee, 2010). In these studies, different measures were applied to different da-

tasets and arrived at different conclusions. Some studies where a set of measures

was employed reported an interesting picture, where adjustments improved the ac-

curacy in certain settings according to MdAPE (median absolute percentage er-

ror), while harming the accuracy in the same settings according to MAPE (Fildes

et al., 2009; Trapero, Pedregal, Fildes, & Weller, 2011). In practice, such results

may be damaging for forecasters and forecast users, since they do not give a clear

indication of the changes in accuracy that correspond to some well-known loss

function. Using real-world data, this paper considers the appropriateness of vari-

Page 5: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

ous previously used measures, and demonstrates the use of the proposed enhanced

accuracy measurement scheme.

The next section describes the data employed for the analysis in this paper.

Section 3 illustrates the disadvantages and limitations of various well-known error

measures when they are applied to SKU-level data. In the fourth section, the pro-

posed accuracy measure is introduced. The fifth section contains the results from

measuring the accuracy of judgmental adjustments with real-world data using the

alternative measures and explains the differences in the results, demonstrating the

benefits of the proposed enhanced accuracy measure. The concluding section

summarises the results of the empirical evaluation and offers practical recommen-

dations as to which of the different error measures can be employed safely.

2. Descriptive analysis of the source data

The current research employed data collected from a company specialising in

the manufacture of fast-moving consumer goods (FMCG) which are fashionable

in nature. This is an extended data set from one of the companies considered by

Fildes et al. (2009). The company concerned is a leading European provider of

household and personal care products to a wide range of major retailers. Table 1

summarises the data set and contains the number of cases used for the analysis.

Each case includes (i) the one-step-ahead monthly forecast prepared using some

statistical method (this will be called the system forecast); (ii) the corresponding

judgmentally adjusted forecast (this will be called the final forecast); and (iii) the

corresponding actual demand value. The system forecast was obtained using an

enterprise software package, and the final forecast was obtained as a result of a re-

vision of the statistical forecast by experts (Fildes et al., 2009). The two forecasts

coincide when the experts had no extra information to add. The data set is repre-

sentative of most FMCG manufacturing or distribution companies which deal with

large numbers of time series of different lengths relating to different products, and

is similar to the other manufacturing data sets considered by Fildes et al. (2009), in

terms of the total number of time series, the proportion of judgmentally adjusted

forecasts and the frequencies of occurrence of zero errors and zero actuals.

Table 1: Source data summary.

Total number of cases 6882

Total number of time series (SKUs) 412

Period of observations Mar 2004 to Jul 2007

Total number of adjusted statistical forecasts 4779 (69%)

Page 6: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

(% of total number of cases)

Number of zero actual demand periods

(% of total number of cases)

271 (4%)

Number of zero-error statistical forecasts

(% of total number of cases)

47 (<1%)

Number of zero-error judgmentally adjusted forecasts

(% of total number of adjusted forecasts)

61 (1%)

Number of positive adjustments

(% of total number of adjusted forecasts)

3394 (71%)

Number of negative adjustments

(% of total number of adjusted forecasts)

1385 (29 %)

Since the data relate to FMCG, the numbers of cases of zero demand periods

and zero errors are not large (see Table 1). However, the further investigation of

the properties of error measures presented in Section 3 will also consider possible

situations when the data involve small counts, and zero observations occur more

frequently (as is common with intermittent demand data).

As Table 1 shows, for this particular data set, adjustments of positive sign oc-

cur more frequently than adjustments of negative sign. However, in order to char-

acterise the average magnitude of the adjustments, an additional analysis is re-

quired. In their study of judgmental adjustments, Fildes et al. (2009) analysed the

size of judgmental adjustments using the measure of relative adjustments that is

defined as 100 × (Final forecast − System forecast)/System forecast. As the values of the relative adjustments are scale-independent, they can be

compared across time series. However, the above measure is asymmetrical. For

example, if an expert doubles a statistical forecast (say from 10 units to 20 units),

he/she increases it by 100%, but if he/she halves a statistical forecast (say from 20

units to 10 units), he/she decreases it by 50% (not 100%). The sampling distribu-

tion of the relative adjustment is bounded by –100% on the left side and unbound-

ed on the right side (see Fig. 1). Generally, these effects mean that the distribution

of the relative adjustment may become non-informative about the size of the ad-

justment as measured on the original scale. When defining a ‘symmetric measure’,

Mathews and Diamantopoulos (1987) argued for a measure where the adjustment

size is measured relative to an average of the system and final forecasts. The same

principle is used in the symmetric MAPE (sMAPE) measure proposed by Ma-

kridakis (1993). However, Goodwin and Lawton (1999) later showed that such

approaches still do not lead to the desirable property of symmetry.

Page 7: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

Fig. 1. Histogram of the relative adjustment, measured in percentages.

In this paper, in order to avoid the problem of the non-symmetrical scale of the

relative adjustment, we carry out the analysis of the magnitude of adjustments us-

ing the natural logarithm of the (Final forecast/System forecast) ratio. Log-

transformation is a common approach to restore symmetry with ratio data since

ln (𝐴/𝐵) = −ln (𝐵/𝐴) for any positive numbers 𝐴 and 𝐵.

From Fig. 2, it can be seen that the log-transformed relative adjustment follows

a leptokurtic distribution and this distribution is still non-symmetrical (although

not as severely as for the original data shown on Fig. 1). As is well known, the

sample mean is not an efficient measure of location under departures from normal-

ity (Wilcox, 2005). We therefore used the trimmed mean as a more robust sum-

mary measure of location. The optimal trim level that corresponds to the lowest

variance of the trimmed mean depends on the distribution, which is unknown in

the current case. Some studies have shown that, for symmetrical distributions, a

5% trim generally ensures a high efficiency with a useful degree of robustness

(e.g., Hill & Dixon, 1982). However, it is also known that the trimmed mean gives

a biased estimate if the distribution is skewed (Marques, Neves, & Sarmento,

2000). We used a 2% trim in order to eliminate the influence of outliers while at

the same time avoiding introducing a substantial bias.

The results presented in Table 2 suggest that, on average, for the dataset under

consideration, the magnitude of positive adjustments is higher than the magnitude

of negative adjustments, measured relative to the system forecast. Even after using

a log scale to treat percentages to baseline symmetrically, the magnitude of posi-

tive adjustments is pronouncedly higher than the magnitude of negative ones. The

average magnitude of a positive relative adjustment is about twice as large as the

Page 8: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

average magnitude of a negative adjustment. Also, adjustments with positive signs

have much higher ranges than negative ones.

Fig. 2. Histogram of ln(Final forecast/System forecast).

Table 2: Summary statistics for the magnitude of adjustment.

Sign of

adjust-

ment

ln(Final forecast/System forecast)

1st quartile Median 3rd quar-

tile

Mean(2% trim) exp[Mean(2% trim)]

Positive 0.123 0.273 0.592 0.412 1.510

Negative –0.339 –0.153 –0.071 –0.290 0.749

Both –0.043 0.144 0.425 0.218 1.243

Page 9: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

3. Appropriateness of existing measures for SKU-level demand

data

3.1. Percentage errors

Let the forecasting error for a given time period 𝑡 and SKU 𝑖 be

𝑒𝑖,𝑡 = 𝑌𝑖,𝑡 − 𝐹𝑖,𝑡,

where 𝑌𝑖,𝑡 is a demand value for SKU 𝑖 observed at time 𝑡, and 𝐹𝑖,𝑡 is the forecast

of 𝑌𝑖,𝑡.

A traditional way to compare the accuracy of forecasts across multiple time se-

ries is based on using absolute percentage errors (Hyndman & Koehler, 2006). Let

us define the percentage error (PE) as 𝑝𝑖,𝑡 = 100 × 𝑒𝑖,𝑡/𝑌𝑖,𝑡. Hence, the absolute

percentage error (APE) is |𝑝𝑖,𝑡|. The most popular PE-based measures are MAPE

and MdAPE, which are defined as follows:

MAPE = mean(|𝑝𝑖,𝑡|),

MdAPE = median(|𝑝𝑖,𝑡|),

where mean(|𝑝𝑖,𝑡|) denotes the sample mean of |𝑝𝑖,𝑡| over all available values of i

and t, and median(|𝑝𝑖,𝑡|) is the sample median.

In the study by Fildes et al. (2009), these measures served as the main tool for

the analysis of the accuracy of judgmental adjustments. In order to determine the

change in forecasting accuracy, MAPE and MdAPE values of the statistical base-

line forecasts and final judgmentally adjusted forecasts were calculated and com-

pared. The significance of the change in accuracy was assessed based on the dis-

tribution of the differences between the absolute percentage errors (APEs) of

forecasts. The difference between APEs is defined as

𝑑𝑖,𝑡APE = |𝑝𝑖,𝑡

f | − |𝑝𝑖,𝑡s |,

where |𝑝𝑖,𝑡f | and |𝑝𝑖,𝑡

s | denote APEs for the final and system forecasts, respectively,

for a given SKU 𝑖 and period 𝑡. Fildes et al. (2009) used the Wilcoxon’s two-

sample paired signed rank test to compare the APEs of the final and system fore-

casts. This is equivalent to performing a one-sample signed rank test to test the

median of 𝑑𝑖,𝑡APE against zero.

Page 10: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

The sample mean of 𝑑𝑖,𝑡APE is the difference between the MAPE values corre-

sponding to the final and system forecasts:

mean(𝑑𝑖,𝑡APE) = mean(|𝑝𝑖,𝑡

f |) − mean(|𝑝𝑖,𝑡s |) = MAPEf − MAPEs. (1)

Therefore, testing the median of 𝑑𝑖,𝑡APE against zero using the above tests leads

to establishing whether MAPEf differs significantly from MAPEs (in case if the

distribution of 𝑑𝑖,𝑡APE is symmetric, which is one of the assumptions of the above

tests). In fact, the distribution of 𝑑𝑖,𝑡APE is inherently skewed, which complicates

matters and may result in drawing erroneous conclusions, but we will now not fo-

cus on this particular problem.

The results reported suggest that, overall, the value of MAPE was improved by

the use of adjustments, but the accuracy of positive and negative adjustments dif-

fered substantially. Based on the MAPE measure, it was found that positive ad-

justments did not change the forecasting accuracy significantly, while negative ad-

justments led to significant improvements. However, percentage error measures

have a number of disadvantages when applied to the adjustments data, as we ex-

plain below.

One well-known disadvantage of percentage errors is that when the actual val-

ue 𝑌𝑖,𝑡 in the denominator is relatively small compared to the forecast error 𝑒𝑖,𝑡, the

resulting percentage error 𝑝𝑖,𝑡 becomes extremely large, which distorts the results

of further analyses (Hyndman & Koehler, 2006). Such high values can be treated

as outliers, since they often do not allow for a meaningful interpretation (large

percentage errors are not necessarily harmful or damaging, as they can arise mere-

ly from relatively low actual values). However, identifying outliers in a skewed

distribution is a non-trivial problem, where it is necessary to determine an appro-

priate trimming level in order to find robust estimates, while at the same time

avoiding losing too much information. Usually authors choose the trimming level

for MAPE based on their experience after experimentation (for example, Fildes et

al., 2009, used a 2% trim), but this decision still remains subjective. Moreover, the

trimmed mean gives a biased estimate of location for highly skewed distributions

(Marques et al., 2000), which complicates the interpretation of the trimmed

MAPE. In particular, for a random variable that follows a highly skewed distribu-

tion, the expected value of the trimmed mean differs from the expected value of

the random variable itself. This bias depends on both the trim level and the num-

ber of observations used to calculate the trimmed mean. Therefore, it is difficult to

compare the measurement results based on the trimmed means for samples that

contain different numbers of observations, even when the trim level remains the

same.

Page 11: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

SKU-level demand time series typically exhibit a high degree of variation in

actual values, due to seasonal effects and the changing stages of a product’s life

cycle. Therefore, data on adjustments can contain a high proportion of low de-

mand values, which makes PE-based measures particularly inadvisable in this

context. Considering extremes, a common occurrence in the situation of intermit-

tent demand is for many observations (and forecasts) to be zero (see the discussion

by Syntetos & Boylan, 2005). All cases with zero actual values must be excluded

from the analysis, since the percentage error cannot be computed when 𝑌𝑖,𝑡 = 0,

due to its definition.

The extreme percentage errors that can be obtained can be shown using scaled

values of errors and actual demand values (Fig. 3). The variables shown were

scaled by the standard deviation of actual values in each series in order to elimi-

nate the differences between time series. It can be seen that the final forecast er-

rors have a skewed distribution and are correlated with both the actual values and

the signs of adjustments; it is also clear that a substantial number of the errors are

comparable to the actual demand values. Excluding observations with relatively

low values on the original scale (here, all observations less than 10 were excluded

from the analysis, as was done by Fildes et al., 2009) still cannot improve the

properties of percentage errors sufficiently, since a large number of observations

still remain in the area where the actual demand value is less than the absolute er-

ror. This results in extremely high APEs (>100%), which are all too easy to misin-

terpret (since very large APEs do not necessarily correspond to very damaging er-

rors, and arise primarily because of low actual demand values). In Fig. 3, the area

below the dashed line shows cases in which the errors were higher than the actual

demand values. These cases result in extreme percentage errors, as shown in Fig.

4. Due to the presence of extreme percentages, the distribution of APEs becomes

highly skewed and heavy-tailed, which makes MAPE-based estimates highly un-

stable.

A widely used robust alternative to MAPE is MdAPE. However, MdAPE is

neither easily interpretable nor sufficiently indicative of changes in accuracy when

forecasting methods have different shaped error distributions. The sample median

of the APEs is resistant to the influence of extreme cases, but is also insensitive to

large errors, even if they are not outliers or extreme percentages. Comparing the

accuracy using the MdAPE shows the changes in accuracy that relate to the lowest

50% of APEs. However, MdAPE’s improvement can be accompanied by remain-

ing more damaging errors lying above the median if the shapes of the error distri-

butions differ. In Section 5, it will be shown that, while the MdAPE indicates that

judgmental adjustments improve the accuracy for a given dataset, the trimmed

MAPE suggests the opposite to be the case. Moreover, the task of assessing the

statistical significance of changes for MdAPE can be problematic due to the non-

symmetric distributions of APEs. Therefore, additional indicators are required in

Page 12: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

order to be able to draw better-substantiated conclusions with regard to the fore-

casting accuracy.

0

5

10

15

-10 -5 0 5 10

0

5

10

15

-10 -5 0 5 10

scaled forecast error scaled forecast error

(a) Positive adjustments (b) Negative adjustments

sca

led

actu

al d

em

an

d v

alu

e

Fig. 3. Dependencies between forecast error, actual value, and the sign of adjustment

(based on scaled data).

0

5

10

15

-5 0 5

0

5

10

15

-5 0 5

percentage error, 100%

sca

led

actu

al d

em

an

d v

alu

e

percentage error, 100%

(a) Positive adjustments (b) Negative adjustments

Fig. 4. Percentage errors, depending on the actual demand value and adjustment sign.

Apart from the presence of extreme APEs, another problem with using PE-

based measures is that they can bias the comparison in favour of methods that is-

sue low forecasts (Armstrong, 1985; Armstrong & Collopy, 1992; Kolassa &

Schutz, 2007). This happens because, under certain conditions, percentage errors

put a heavier penalty on positive errors than on negative errors. In particular, we

can observe it when the forecast is taken as fixed. To illustrate this phenomenon,

Kolassa and Schutz (2007) provide the following example. Assume that we have a

time series that contains values distributed uniformly between 10 and 50. If we are

Page 13: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

using a symmetrical loss function, the best forecast for this time series would be

30. However, a forecast of 22 produces a better accuracy in terms of MAPE. As a

result, if the aim is to choose a method that is better in terms of a linear loss, then

the values of PE-based measures can be misleading. The way in which the use of

MAPE can bias the comparison of the performances of judgmental adjustments of

different signs will be illustrated below.

One important effect which arises from the presence of cognitive biases and the

non-negative nature of demand values is the fact that the most damaging positive

adjustments (producing the largest absolute errors) typically correspond to rela-

tively low actuals (left corner of Fig. 3(a)), while the worst negative adjustments

(producing the largest absolute errors) correspond to higher actuals (centre sec-

tion, Fig. 3(b)). More specifically, the following general dependency can be found

within most time series. The difference between the absolute final forecast error

|𝑒𝑖,𝑡f | and the absolute statistical forecast error |𝑒𝑖,𝑡

s | is positively correlated with

the actual value 𝑌𝑖,𝑡 for positive adjustments, while there is a negative correlation

for negative adjustments. To reveal this effect, distribution-free measures of the

association between variables were used. For each SKU 𝑖, Spearman’s 𝜌 coeffi-

cients were calculated, representing the correlation between the improvement in

terms of absolute errors (|𝑒𝑖,𝑡f | − |𝑒𝑖,𝑡

s |) and the actual value 𝑌𝑖,𝑡. Fig. 5 shows the

distributions of the coefficients 𝜌𝑖+, calculated for positive adjustments, and 𝜌𝑖

−,

corresponding to negative adjustments (the coefficients can take values 1 and –1

when only a few observations are present in a series). For the given dataset,

mean(𝜌𝑖+) ≈ 0.47 and mean(𝜌𝑖

−) ≈ −0.44, indicating that the improvement in

forecasting is markedly correlated with the actual demand values. This illustrates

the fact that positive adjustments are most effective for larger values of demand,

and least effective (or even damaging) for smaller values of demand. Actually, ef-

ficient averaging of correlation coefficients requires applying Fisher's z transfor-

mation to them and then transforming back the result (see, e.g., Mudhoklar, 1983).

But here we used raw coefficients because we only wanted to show that the ρ val-

ue clearly correlates with the adjustment sign.

Page 14: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

10.50-0.5-1

0.3

0.2

0.1

010.50-0.5-1

0.4

0.3

0.2

0.1

0

Coefficient of correlationbetween forecast improvement

and actual demand, ri+

(a) Positive adjustments (b) Negative adjustments

Coefficient of correlationbetween forecast improvement

and actual demand, ri-

Fig. 5. Spearman’s 𝝆 coefficients showing the correlation between the improvement in

accuracy and the actual demand value for each time series (relative frequency histo-

grams).

Because of the division by the scale factor that is correlated with the numerator,

the difference of APEs (which is calculated as 𝑑𝑖,𝑡APE = 100 × (|𝑒𝑖,𝑡

f | − |𝑒𝑖,𝑡s |) 𝑌𝑖,𝑡⁄ )

will not reflect changes in forecasting accuracy in terms of a symmetric loss func-

tion. More specifically, for positive adjustments, 𝑑𝑖,𝑡APE will systematically down-

grade improvements in accuracy and exaggerate degradations of accuracy (on the

percentage scale). In contrast, for negative adjustments, the improvements will be

exaggerated, while the errors from harmful forecasts will receive smaller weights.

Since the difference in MAPEs is calculated as the sample mean of 𝑑𝑖,𝑡APE (in ac-

cordance with equation (1)), the comparison of forecasts using MAPE will also

give a result which is biased towards underrating positive adjustments and overrat-

ing negative adjustments. Consequently, since the forecast errors arising from ad-

justments of different signs are penalised differently, the MAPE measure is flawed

when comparing the performances of adjustments of different signs. One of the

aims of the present research has therefore been to reinterpret the results of previ-

ous studies through the use of alternative measures.

A second measure based on percentage errors was also used by Franses and

Legerstee (2010). In order to evaluate the accuracy of improvements, the RMSPE

(root mean square percentage error) was calculated for the statistical and judgmen-

tally adjusted forecasts, and the resulting values were then compared. Based on

this measure, it was concluded that the expert adjusted forecasts were no better

than the model forecasts. However, the RMSPE is also based on percentage errors,

and is affected by the outliers and biases described above even more strongly.

Page 15: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

3.2. Relative errors

Another approach to obtaining scale-independent measures is based on using

relative errors. The relative error (RE) is defined as

𝑅𝐸𝑖,𝑡 = 𝑒𝑖,𝑡/𝑒𝑖,𝑡b ,

where 𝑒𝑖,𝑡b is the forecast error obtained from a benchmark method. Usually a na-

ïve forecast is taken as the benchmark method.

Well-known measures based on relative errors include Mean Relative Absolute

Error (MRAE), Median Relative Absolute Error (MdRAE), and Geometric Mean

Relative Absolute Error (GMRAE):

MRAE = mean(|𝑅𝐸𝑖,𝑡|),

MdRAE = median(|𝑅𝐸𝑖,𝑡|),

GMRAE = gmean(|𝑅𝐸𝑖,𝑡|),

where mean, median, and gmean respectively denote the sample mean, sample

median, and the sample geometric mean over all possible values of 𝑖 and 𝑡.

Averaging the ratios of absolute errors across individual observations over-

comes the problems related to dividing by actual values. In particular, the RE-

based measures are not affected by the presence of low actual values, or by the

correlation between errors and actual outcomes. However, REs also have a num-

ber of limitations.

The calculation of 𝑅𝐸𝑖,𝑡 requires division by the non-zero error of the bench-

mark forecast 𝑒𝑖,𝑡b . In the case of calculating GMRAE, it is also required that

𝑒𝑖,𝑡 ≠ 0. The actual and forecasted demands are usually count data, which means

that the forecasting errors are count data as well. With count data, the probability

of a zero error of the benchmark forecast can be non-zero. Such cases must be ex-

cluded from the analysis when using relative errors. When using intermittent de-

mand data, the use of relative errors becomes impossible due to the frequent oc-

currences of zero errors (Hyndman, 2006; Syntetos & Boylan, 2005).

As was pointed out by Hyndman & Koehler (2006), in the case of continuous

distributions, the benchmark forecast error 𝑒𝑖,𝑡b can have a positive probability den-

sity at zero, and therefore the use of MRAE can be problematic. In particular,

𝑅𝐸𝑖,𝑡 can follow a heavy-tailed distribution for which the sample mean becomes a

highly inefficient estimate that is vulnerable to outliers. In addition, the distribu-

tion of |𝑅𝐸𝑖,𝑡| is highly skewed. At the same time, while MdRAE is highly robust,

it cannot be sufficiently informative, as it is insensitive to large REs which lie in

Page 16: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

the tails of the distribution. Thus, even if the large REs are not outliers which arise

from the division by relatively small benchmark errors, they still will not be taken

into account when using MdRAE. Averaging the absolute REs using GMRAE is

preferable to using either MRAE or MdRAE, as it provides a reliable and robust

estimate and at the same time takes into account the values of REs which lie in the

tails of the distribution. Also, when averaging the benchmark ratios, the geometric

mean has the advantage that it produces rankings which are invariant to the choice

of the benchmark (see Fleming & Wallace, 1986).

Fildes (1992) recommends the use of the Relative Geometric Root Mean

Square Error (RelGRMSE). The RelGRMSE for a particular time series 𝑖 is de-

fined as

RelGRMSE𝑖 = (∏ (𝑒𝑖,𝑡)

2𝑡∈𝑇𝑖

∏ (𝑒𝑖,𝑡b )

2𝑡∈𝑇𝑖

)

12𝑛𝑖

,

where 𝑇𝑖 is a set containing time periods for which non-zero errors 𝑒𝑖,𝑡 and 𝑒𝑖,𝑡b are

available, and 𝑛𝑖 is the number of elements in 𝑇𝑖 .

After obtaining the RelGRMSE for each series, Fildes (1992) recommends

finding the geometric mean of the RelGRMSEs across all time series, thus obtain-

ing gmean(RelGRMSE𝑖). As Hyndman (2006) pointed out, the Geometric Root

Mean Square Error (GRMSE) and the Geometric Mean Absolute Error (GMAE)

are identical because the square roots cancel each other in a geometric mean.

Similarly, it can be shown that

gmean(RelGRMSE𝑖) = GMRAE.

An alternative representation of GMRAE is:

GMRAE = exp [1

∑ 𝑛𝑖𝑚𝑖=1

∑ ∑ ln|𝑅𝐸𝑖,𝑡|𝑡∈𝑇𝑖

𝑚

𝑖=1],

where 𝑚 is the total number of time series, and other variables retain their previ-

ous meaning.

For the adjustments data set under consideration, only a small proportion of ob-

servations contain zero errors (about 1%). It has been found empirically that for

the given data set the log-transformed absolute REs, ln|𝑅𝐸𝑖,𝑡|, can be approximat-

ed adequately using a distribution which has a finite variance. In fact, even if a

heavy-tailed distribution of ln|𝑅𝐸𝑖,𝑡| arises, the influence of extreme cases can be

eliminated based on various robustifying schemes such as trimming or Winsoriz-

Page 17: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

ing. In contrast to APEs, the use of such schemes for ln|𝑅𝐸𝑖,𝑡| is unlikely to lead

to biased estimates, since the distribution of ln|𝑅𝐸𝑖,𝑡| is not highly skewed.

Though GMRAE (or, equivalently, gmean(RelGRMSE𝑖)) has some desirable

statistical properties and can give a reliable aggregated indication of changes in

accuracy, its use can be complicated for the following two reasons. Firstly, as was

mentioned previously, zero-error forecasts cannot be taken into account directly.

Secondly, in a similar way to the median, the geometric mean of absolute errors

generally does not reflect changes in accuracy under standard loss functions. For

instance, for a particular time series, GMAE (and, hence, GMRAE) favours meth-

ods which produce errors with heavier tailed-distributions, while for the same se-

ries RMSE (root mean square error) can suggest the opposite ranking.

The latter aspect of using GMRAE can be illustrated using the following ex-

ample. Suppose that for a particular time series, method A produces errors 𝑒𝑡A that

are independent and identically distributed variables following a heavy-tailed dis-

tribution. More specifically, let 𝑒𝑡A follow the t-distribution with 𝜈 = 3 degrees of

freedom: 𝑒𝑡A ~ 𝑡𝜈. Also, let method B produce independent errors that follow the

normal distribution: 𝑒𝑡B ~ 𝑁(0, 3). Let method B be the benchmark method. It can

be shown analytically that the variances for 𝑒𝑡A and 𝑒𝑡

B are equal: Var(𝑒𝑡A) =

Var(𝑒𝑡B) = 3. Thus, the relative RMSE (RelRMSE, the ratio of the two RMSEs)

for this series is 1. However, the Relative Geometric RMSE (or GMRAE) will

show that method A is better than method B: GMRAE ≈ 0.69 (based on 106 simu-

lated pairs of 𝑒𝑡A and 𝑒𝑡

B). Now if, for example, 𝑒𝑡B ~ 𝑁(0, 2.5), then the RelRMSE

and GMRAE will be 1.10 and 0.76, respectively. This means that method B is now

preferable in terms of the variance of errors, while method A is still (substantially)

better in terms of the GMRAE. However, the geometric mean absolute error is

rarely used when optimising predictions with the use of mathematical models.

Some authors claim that the comparison based on RelRMSE can be more desira-

ble, as in this case the criterion used for the optimisation of predictions corre-

sponds to the evaluation criteria (Zellner, 1986; Diebold, 1993).

The above example has demonstrated that even for a single time series a statis-

tically significant improvement of GMRAE is not equivalent to a statistically sig-

nificant improvement in terms of RMSE. Analogously, it can be demonstrated that

the GMRAE is not indicative of changes in terms of MAE.

Thus, analogously to what was said with regard to PE-based measures, if the

aim of the comparison is to choose a method that is better in terms of a linear or a

quadratic loss, then GMRAE may not be sufficiently informative, or may even

lead to counterintuitive conclusions.

Page 18: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

3.3. Percent Better

A simple approach to compare forecasting accuracy of methods A and B is to

calculate the percentage of cases when method A was closer to the actual observa-

tion than method B. This measure is known as ‘Percent Better’ (further abbreviat-

ed as PB) and was recommended by some authors as a fairly good indicator (see,

e.g., Armstrong & Collopy, 1992; Chatfield, 2001). It has the advantage of being

immune to outliers and is scale-independent (it can therefore be used to assess ac-

curacy across series). In addition, it can be used for qualitative forecasts (but we

will not look at this kind of forecasts in this paper). Although the measure seems

to be easy to interpret, the following important limitations should be taken into ac-

count.

One problem with PB is that it does not show the magnitude of changes in ac-

curacy (Hyndman & Koehler, 2006). Therefore, it becomes hard to assess the con-

sequences of using one method instead of another. Moreover, as was the case for

the GMRAE, we can show that if shapes of error distributions are different for dif-

ferent methods, PB becomes non-indicative of changes in terms of a linear or

quadratic loss even for a single series.

Another problem arises when methods A and B frequently produce equal fore-

casts (e.g., this happens with intermittent demand data). In such situations, obtain-

ing a PB value that is lower than 50% is not necessarily a bad result, but without

additional information we cannot draw any conclusions about the changes in accu-

racy. Suppose absolute errors for methods A and B can be approximated using the

Poisson distribution: |𝑒𝑡A| ~ Pois(𝜆 = 1) and |𝑒𝑡

B| ~ Pois(𝜆 = 3). Method A is

much better than method B in terms of MAE: 𝐸[|𝑒𝑡A|]/𝐸[|𝑒𝑡

B|] = 1/3, but

𝑃(|𝑒𝑡A| < |𝑒𝑡

B|) ≈ 0.077. Thus, the PB is, approximately, only 7.7 % – a figure

that can be misleading. For this example, even looking at ‘Percent Worse’ and re-

lating it to the PB will also not give us an informative and easily interpretable in-

dication of accuracy.

Thus, in spite of its apparent simplicity, the PB measure is often confusing and

does not necessarily show changes in accuracy under linear loss. Moreover, it is

not representative of the magnitude of changes and therefore it cannot ensure a

complete and reliable analysis of accuracy.

3.4. Scaled errors

In order to overcome the imperfections of PE-based measures, Hyndman and

Koehler (2006) proposed the use of the MASE (mean absolute scaled error). For

Page 19: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

(2)

the scenario when forecasts are produced from varying origins but with a constant

horizon, the MASE is calculated as follows (see Appendix A):

𝑞𝑖,𝑡 =𝑒𝑖,𝑡

MAE𝑖b

, MASE = mean(|𝑞𝑖,𝑡|),

where 𝑞𝑖,𝑡 is the scaled error and MAE𝑖b is the mean absolute error (MAE) of the

naïve (benchmark) forecast for series 𝑖. Though this was not specified by Hyndman and Koehler (2006), it is possible

to show (see Appendix A) that in the given scenario, the MASE is equivalent to

the weighted arithmetic mean of relative MAEs, where the number of available

values of 𝑒𝑖,𝑡 is used as the weight:

MASE =1

∑ 𝑛𝑖𝑚𝑖=1

∑ 𝑛𝑖

𝑚

𝑖=1𝑟𝑖 , 𝑟𝑖 =

MAE𝑖

MAE𝑖b

,

where 𝑚 is the total number of series, 𝑛𝑖 is the number of available values of 𝑒𝑖,𝑡

for series 𝑖, MAE𝑖b is the MAE of the benchmark forecast for series 𝑖, and MAE𝑖 is

the MAE of the forecast being evaluated against the benchmark.

It is known that the arithmetic mean is not strictly appropriate for averaging ob-

servations representing relative quantities, and in such situations the geometric

mean should be used instead (Spizman & Weinstein, 2008). As a result of using

the arithmetic mean of MAE ratios, equation (2) introduces a bias towards overrat-

ing the accuracy of a benchmark forecasting method. In other words, the penalty

for bad forecasting becomes larger than the reward for good forecasting.

To show how the MASE rewards and penalises forecasts, it can be represented

as

MASE = 1 +1

∑ 𝑛𝑖𝑚𝑖=1

∑ 𝑛𝑖

𝑚

𝑖=1(𝑟𝑖 − 1).

The reward for improving the benchmark MAE from 𝐴 to 𝐵 (𝐴 > 𝐵) in a series

𝑖 is 𝑅𝑖 = 𝑛𝑖(1 − 𝐵/𝐴), while the penalty for harming MAE by changing it from 𝐵

to 𝐴 is 𝑃𝑖 = 𝑛𝑖(𝐴/𝐵 − 1). Since 𝑅𝑖 < 𝑃𝑖 , the reward given for improving the

benchmark MAE cannot balance the penalty given for reducing the benchmark

MAE by the same quantity. As a result, obtaining MASE > 1 does not necessarily

indicate that the accuracy of the benchmark method was better on average. This

leads to ambiguity in the comparison of the accuracy of forecasts.

For example, suppose that the performance of some forecasting method is

compared with the performance of the naïve method across two series (𝑚 = 2)

Page 20: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

which contain equal numbers of forecasts and observations. For the first series, the

MAE ratio is 𝑟1 = 1/2, and for the second series, the MAE ratio is the opposite:

𝑟2 = 2/1. The improvement in accuracy for the first series obtained using the

forecasting method is the same as the reduction for the second series. However,

averaging the ratios gives MASE = ½ (𝑟1 + 𝑟2) = 1.25, which indicates that the

benchmark method is better. While this is a well-known point, its implications for

error measures, with the potential for misleading conclusions, are widely ignored.

In addition to the above effect, the use of MASE (as for MAPE) may result in

unstable estimates, as the arithmetic mean is severely influenced by extreme cases

which arise from dividing by relatively small values. In this case, outliers occur

when dividing by the relatively small MAEs of benchmark forecast which can ap-

pear in short series.

Some authors (e.g., Hoover, 2006) recommend the use of the MAD/MEAN ra-

tio. In contrast to the MASE, the MAD/MEAN ratio approach assumes that the

forecasting errors are scaled by the mean of time series elements, instead of by the

in-sample MAE of the naïve forecast. The advantage of this scheme is that it re-

duces the risk of dividing by a small denominator (see Kolassa & Schutz, 2007).

However, Hyndman (2006) notes that the MAD/MEAN ratio assumes that the

mean is stable over time, which may make it unreliable when the data exhibit

trends or seasonal patterns. In Section 5, we show that both the MASE and the

MAD/MEAN are prone to outliers for the data set we consider in this paper. Gen-

erally, the use of these schemes has the risk of producing unreliable estimates that

are based on highly skewed left-bounded distributions.

Thus, while the use of the standard MAPE has long been known to be flawed,

the newly proposed MASE also suffers from some of the same limitations, and

may also lead to an unreliable interpretation of the empirical results. We therefore

need a measure that does not suffer from these problems. The next section pre-

sents an improved statistic which is more suitable for comparing the accuracies of

SKU-level forecasts.

4. Recommended accuracy evaluation scheme

4.1. Measuring the accuracy of judgmental adjustments

The recommended forecast evaluation scheme is based on averaging the rela-

tive efficiencies of adjustments across time series. The geometric mean is the cor-

rect average to use for averaging benchmark ratio results, since it gives equal

weight to reciprocal relative changes (Fleming & Wallace, 1986). Using the geo-

Page 21: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

metric mean of MAE ratios, it is possible to define an appropriate measure of the

average relative MAE (AvgRelMAE). If the baseline statistical forecast is taken as

the benchmark, then the AvgRelMAE showing how the judgmentally adjusted

forecasts improve/reduce the accuracy can be found as

AvgRelMAE = (∏ 𝑟𝑖𝑛𝑖

𝑚

𝑖=1)

1/ ∑ 𝑛𝑖𝑚𝑖=1

, 𝑟𝑖 =MAE𝑖

f

MAE𝑖s,

where MAE𝑖s is the MAE of the baseline statistical forecast for series 𝑖, MAE𝑖

f is

the MAE of the judgmentally adjusted forecast for series 𝑖, 𝑛𝑖 is the number of

available errors of judgmentally adjusted forecasts for series 𝑖, and 𝑚 is the total

number of time series. This differs from the proposals of Fildes (1992), who ex-

amined the behaviour of the GRMSEs of the individual relative errors.

The MAEs in equation (3) are found as

MAE𝑖f =

1

𝑛𝑖

∑|𝑒𝑖,𝑡f |

𝑡∈𝑇𝑖

, MAE𝑖s =

1

𝑛𝑖

∑ |𝑒𝑖,𝑡s |

𝑡∈𝑇𝑖

,

where 𝑒𝑖,𝑡f is the error of the judgmentally adjusted forecast for period 𝑡 and series

𝑖, 𝑇𝑖 is a set containing the time periods for which 𝑒𝑖,𝑡f are available, and 𝑒𝑖,𝑡

s is the

error of the baseline statistical forecast for period 𝑡 and series 𝑖. AvgRelMAE is immediately interpretable, as it represents the average relative

value of MAE adequately, and directly shows how the adjustments im-

prove/reduce the MAE compared to the baseline statistical forecast. Obtaining

AvgRelMAE < 1 means that on average MAE𝑖f < MAE𝑖

s, and therefore adjust-

ments improve the accuracy, while AvgRelMAE > 1 indicates the opposite. The

average percentage improvement in MAE of forecasts is found as (1 −AvgRelMAE) × 100. If required, equation (3) can also be extended to other

measures of dispersion or loss functions. For example, instead of MAE one might

use the MSE (mean square error), interquartile range, or mean prediction interval

length. The choice of the measure depends on the purposes of analysis. In this

study, we use MAE, assuming that the penalty is proportional to the absolute er-

ror.

Equivalently, the geometric mean of MAE ratios can be found as

AvgRelMAE = exp (1

∑ 𝑛𝑖𝑚𝑖=1

∑ 𝑛𝑖 ln 𝑟𝑖

𝑚

𝑖=1).

(3)

Page 22: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

Therefore, obtaining ∑ 𝑛𝑖 ln 𝑟𝑖𝑚𝑖=1 < 0 means an average improvement of accu-

racy, and ∑ 𝑛𝑖 ln 𝑟𝑖𝑚𝑖=1 > 0 means the opposite.

In theory, the following effect may complicate the interpretation of the

AvgRelMAE value. If the distributions of errors 𝑒𝑖,𝑡f and 𝑒𝑖,𝑡

s within a given series 𝑖

have different levels of the kurtosis, then ln 𝑟𝑖 is a biased estimate of ln(𝐸|𝑒𝑖,𝑡f |/

𝐸|𝑒𝑖,𝑡s |). Thus, the indication of an improvement under linear loss given by the

AvgRelMAE may be biased. In fact, if 𝑛𝑖 = 1 for each 𝑖, then the AvgRelMAE

becomes equivalent to the GMRAE, which has the limitations described in Section

3.2. However, our experiments have shown that the bias of ln 𝑟𝑖 diminishes rapidly

as 𝑛𝑖 increases, becoming negligible for 𝑛𝑖 > 4.

To eliminate the influence of outliers and extreme cases, the trimmed mean can

be used in order to define a measure of location for the relative MAE. The

trimmed AvgRelMAE for a given threshold 𝑡 (0 ≤ 𝑡 ≤ 0.5) is calculated by ex-

cluding the [𝑡𝑚] lowest and [𝑡𝑚] highest values of 𝑛𝑖 ln 𝑟𝑖 from the calculations

(square brackets indicate the integer part of 𝑡𝑚). As was mentioned in Section 2,

the optimal trim level depends on the distribution. In practice, the choice of the

trim level usually remains subjective, since the distribution is unknown. Wilcox

(1996) wrote that ‘Currently there is no way of being certain how much trimming

should be done in a given situation, but the important point is that some trimming

often gives substantially better results, compared to no trimming’ (p. 16). Our ex-

periments show that a 5% level can be recommended for the AvgRelMAE meas-

ure. This level ensures high efficiency, because the underlying distribution usually

does not exhibit very large departures from the normal distribution. A manual

screening for outliers could also be performed in order to exclude time series with

non-typical properties from the analysis.

The results described in the next section show that the robust estimates ob-

tained using a 5% trimming level are very close to the estimates based on the

whole sample. The distribution of 𝑛𝑖 ln 𝑟𝑖 is more symmetrical than the distribu-

tion of either the APEs or absolute scaled errors. Therefore, the analysis of the

outliers in relative MAEs can be performed more efficiently than the analysis of

outliers when using the measures considered previously. Besides, we can assess

the statistical significance of changes in accuracy by testing the mean of 𝑛𝑖 ln 𝑟𝑖

against zero.

Since the AvgRelMAE does not require scaling by actual values, it can be used

in cases of low or zero actuals, as well as in cases of zero forecasting errors. Con-

sequently, it is suitable for intermittent demand forecasts. The only limitation is

that the MAEs in equation (3) should be greater than zero for all series. If zero

MAEs do occur, they can be handled by the procedure that we describe below.

Thus, the advantages of the recommended accuracy evaluation scheme are that

it (i) can be interpreted easily, (ii) represents the performance of the adjustments

objectively (without the introduction of substantial biases or outliers), (iii) is in-

Page 23: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

formative and uses available information efficiently, (iv) is applicable in a wide

range of settings, with minimal assumptions about the features of the data, and (v)

gives rankings and indicates relative improvements that are invariant to the choice

of the benchmark. Importantly, the last property can be ensured only through the

use of the geometric mean. If we used a sample median or sample mean instead,

this could lead to different rankings depending on the choice of the benchmark.

4.2. Generalized scheme for measuring the accuracy of point

forecasts

In general, in order to ensure a reliable evaluation of forecasting accuracy un-

der a symmetric linear loss, we recommend using the following scheme. Suppose

we want to measure the accuracy of ℎ-step-ahead forecasts produced with some

forecasting method A across 𝑚 time series. Firstly, we need to select a benchmark

method. This, in particular, can be the naïve method. Let 𝑛𝑖 denote the number of

periods for which both the ℎ-step-ahead forecasts and actual observations are

available for series 𝑖. Then the accuracy measurement procedure is as follows:

1. For each 𝑖 in 1 … 𝑚

a. Calculate the relative MAE as 𝑟𝑖 =MAE𝑖

A

MAE𝑖B,

where MAE𝑖A and MAE𝑖

B denote out-of-sample h-step-ahead

MAEs for method A and for the benchmark, respectively.

b. Calculate the weighted log relative MAE as 𝑙𝑖 = 𝑛𝑖 ln 𝑟𝑖 .

2. Calculate the Average Relative MAE as

AvgRelMAE = exp (1

∑ 𝑛𝑖𝑚𝑖=1

∑ 𝑙𝑖

𝑚

𝑖=1) .

If there is an evidence for a non-normal distribution of 𝑙𝑖, use the following

procedure to ensure more efficient estimates:

a. Find the indices of 𝑙𝑖 that correspond to the 5% of largest and 5%

of lowest values. Let 𝑅 be a set that contains the remaining indi-

ces.

Page 24: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

b. Calculate the trimmed version of the AvgRelMAE:

AvgRelMAEtrimmed = exp (1

∑ 𝑛𝑖𝑖∈𝑅

∑ 𝑙𝑖𝑖∈𝑅

).

3. Assess the statistical significance of changes by testing the mean of 𝑙𝑖

against zero. For this purpose, the Wilcoxon’s one-sample signed rank test

can be used (assuming that the distribution of 𝑙𝑖 is symmetric, but not nec-

essarily normal). If the distribution of 𝑙𝑖 is non-symmetric, the binomial

test can be used to test the median of 𝑙𝑖 against zero. If the distribution has

a negative skew then it is likely that the negative median will indicate neg-

ative mean as well.

Notes: (a) For low volume data it can be the case that MAE𝑖A = 0 or MAE𝑖

B = 0

(or both). Essentially, MAE represents our estimate of the expected

value of absolute error. But our prior knowledge suggests that the ex-

pected value of absolute error is larger than zero because for any fore-

casting task we assume that some level of uncertainty is present. There-

fore, obtaining a zero MAE is an inadequate result and we may use

some sufficiently small number instead (say MAE=0.001). The extreme

𝑟𝑖 values corresponding to such cases should then be excluded from the

analysis on step 2 by setting a sufficiently large trim level. If the fre-

quency of obtaining zero MAEs is too high (say larger than 30%), a re-

liable estimation of the average relative MAE becomes unavailable, and

we then have to resort to simply estimating the success rate for the

MAE improvement. This can be done by calculating the number of cas-

es when MAE𝑖A < MAE𝑖

B, 𝑖 = 1 … , 𝑚, and then dividing this number by

the total number of time series, 𝑚. Importantly, as mentioned in Section

3.3, getting a success rate that is statistically lower than 0.5 does not

necessarily indicate that method A is worse than method B for count da-

ta (because of the possibility of equal MAEs); therefore the sum of

ranks should be reported as well. But it is also important to keep in

mind that neither the success rate nor the sum of ranks will be indica-

tive of improvements under linear loss if sampling distribution for 𝑙𝑖 is

heavily skewed.

(b) If distribution of absolute errors is heavily skewed, the MAE, as any

sample mean, becomes a very inefficient estimate of the expected value

of absolute error. One simple method to improve the efficiency of the

estimates while not introducing substantial bias is to use asymmetric

trimming algorithms, such as those described by (Alkhazeleh and

Page 25: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

Razali, 2010). However, further discussions on this topic are outside the

scope of our paper.

(c) If a suitable benchmark method is unavailable, we can use the sample

mean of time series values instead of the benchmark MAE. The proce-

dure then becomes similar to the MAD/MEAN ratio approach described

in Section 3.4, but here the use of the geometric mean i) ensures the

correct averaging of ratios (i.e., deviations from the mean will be treat-

ed symmetrically) and ii) gives more robust measurement results in

cases when mean time series values are relatively small compared to

absolute forecasting errors.

(d) In step 2, the optimal trim level depends on the shape of the distribution

of 𝑙𝑖. Our experiments suggest that, for the distributions that are likely

to be obtained, the efficiency of the trimmed mean is not highly sensi-

tive to the choice of the trim level and any value between 2% and 10%

gives reasonably good results. Generally, as was shown by (Andrews et

al., 1972), when the underlying distribution is symmetrical and heavy-

tailed relative to the Gaussian, the variance of the trimmed mean is

quite a lot smaller than the variance of the sample mean. Therefore, the

use of the trimmed means for symmetrical distributions can be highly

recommended.

5. Results of empirical evaluation

The results of applying the measures described above are shown in Table 3.

For the given dataset, a large number of APEs have extreme values (>100%)

which arise from low actual demand values (Fig. 6). Following Fildes et al.

(2009), we used a 2% trim level for MAPE values. However, as noted, it is diffi-

cult to determine an appropriate trim level. As a result, the difference in APEs be-

tween the system and final forecasts has a very high dispersion and cannot be used

efficiently to assess improvements in accuracy. It can also be seen that the distri-

bution of APEs is highly skewed, which means that the trimmed means cannot be

considered as unbiased estimates of the location. Albeit the distribution of the

APEs has a very high kurtosis, our experiments show that increasing the trim level

(say from 2% to 5%) would substantially bias the estimates of the location of the

APEs due to the extremely high skewness of the distribution. We therefore use the

2% trimmed MAPE in this study. Also, the use of this trim level makes the meas-

urement results comparable to the results of Fildes et al. (2009).

Page 26: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

Table 3: Accuracy of adjustments according to different error measures

Error measure

Positive adjustments Negative adjustments All nonzero adjustments

Statistical

forecast

Adjusted

forecast

Statistical

forecast

Adjusted

forecast

Statistical

forecast

Adjusted

forecast

MAPE, %

(untrimmed) 38.85 61.54 70.45 45.13 47.88 56.85

MAPE, %

(2 % trimmed) 30.98 40.56 48.71 30.12 34.51 37.22

MdAPE, % 25.48 20.65 23.90 17.27 24.98 19.98

GMRAE 1.00 0.93 1.00 0.70 1.00 0.86

GMRAE

(5 % trimmed) 1.00 0.94 1.00 0.71 1.00 0.87

MASE 0.97 0.97 0.95 0.70 0.96 0.90

Mean (MAD/Mean) 0.37 0.42 0.33 0.24 0.36 0.37

Mean (MAD/Mean)

(5 % trimmed) 0.34 0.35 0.29 0.21 0.33 0.31

AvgRelMAE 1.00 0.96 1.00 0.71 1.00 0.90

AvgRelMAE

(5 % trimmed) 1.00 0.96 1.00 0.73 1.00 0.89

Avg. improvement based

on AvgRelMAE 0.00 0.04 0.00 0.29 0.00 0.10

Fig. 6. Box-and-whisker plot for absolute percentage errors (log scale, zero-error fore-

casts excluded).

Table 3 shows that the rankings based on the trimmed MAPE and MdAPE dif-

fer, suggesting different conclusions about the effectiveness of adjustments. As

was explained in Section 3.1, the interpretation of PE-based measures is not

straightforward. While MdAPE is resistant to outliers, it is not sufficiently in-

Page 27: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

formative, as it is insensitive to APEs which lie above the median. Also, PE-

measures produce a biased comparison, since the improvement on the real scale

within each series is correlated markedly with the actual value. Therefore, apply-

ing percentage errors in the current setting leads to ambiguous results and to con-

fusion in their interpretation. For example, for positive adjustments, the trimmed

MAPE and MdAPE suggest the opposite rankings: while the trimmed MAPE

shows a substantial worsening of the final forecast due to the judgmental adjust-

ments, the MdAPE value points in the opposite direction.

The absolute scaled errors found using the MASE scheme (as described in Sec-

tion 3.4) also follow a non-symmetrical distribution and can take extremely large

values (Fig. 7) in short series where the MAE of the naïve forecast is smaller than

the error of judgmental forecast. For the adjustments data, the lengths of the series

vary substantially, so the MASE is affected seriously by outliers. Fig. 8 shows that

the use of the MAD/MEAN scheme instead of the MASE does not improve the

properties of the distribution of the scaled errors. Table 3 shows that a trimmed

version of the MAD/MEAN scheme gives the opposite rankings with regard to the

overall accuracy of adjustments, which indicates that this scheme is highly unsta-

ble. Moreover, with such distributions, the use of trimming for either MASE or

MAD/MEAN leads to biased estimates, as was the case with MAPE.

Fig. 7. Box-and-whisker plot for the absolute scaled errors found by the MASE

scheme (log scale, zero-error forecasts excluded).

Fig. 8. Box-and-whisker plot for absolute scaled errors found by the MAD/MEAN

scheme (log scale, zero-error forecasts excluded).

Page 28: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

Fig. 9 shows that the log-transformed relative absolute errors follow a symmet-

ric distribution and contain outliers that are easier to detect and to eliminate. Based

on the shape of the underlying distribution, it seems that using a 5% trimmed

GMRAE would give a location estimate with a reasonable level of efficiency. Alt-

hough the GMRAE measure is not vulnerable to outliers, its interpretation can

present difficulties, for the reasons explained in Section 3.2.

Fig. 9. Box-and-whisker plot for the log-transformed relative absolute errors (using

the statistical forecast as the benchmark).

Compared to the APEs and the absolute scaled errors, the log-transformed rela-

tive MAEs are not affected severely by outliers and have a more symmetrical dis-

tribution (Fig. 10). The AvgRelMAE can therefore serve as a more reliable indica-

tor of changes in accuracy. At the same time, in terms of a linear loss function the

AvgRelMAE scheme represents the effectiveness of adjustments adequately and

gives a directly interpretable meaning.

Fig. 10. Box-and-whisker plot for the weighted log-transformed relative MAEs

(𝑛𝑖 ln 𝑟𝑖).

The AvgRelMAE result shows improvements from both positive and negative

adjustments, whereas according to MAPE and MASE, only negative adjustments

improve the accuracy. For the whole sample, adjustments improve the MAE of

statistical forecast by 10%, on average. Positive adjustments are less accurate than

negative adjustments and provide only minor improvements. To assess the signifi-

cance of changes in accuracy in terms of MAE, we applied the two-sided Wilcox-

on test to test the mean of the weighted relative log-transformed MAEs against ze-

ro. The p-value was less than 0.01 for the set containing the adjustments of both

Page 29: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

signs, less than 0.05 for the set containing only positive adjustments, and less than

2.2 ∙ 10−16 for the set containing only negative adjustments.

To determine whether the probability of a successful adjustment is higher than

0.5, the two-sided binomial test was applied. The results are shown in Table 4.

Table 4: Results of using the binomial test to analyse the frequency of a successful adjust-

ment.

Adjust-

just-

ment

sign

Total

number

of ad-

justments

Number of

adjustments

that improved

forecast

p-value Probabil-

ity of a

successful

adjust-

ment

95% confidence

interval for the

probability of a

successful ad-

justment

Positive 3394 1815 <0.001 0.535 0.518 0.552

Negative 1385 915 <0.001 0.661 0.635 0.686

Both 4779 2730 <0.001 0.571 0.557 0.585

Based on the p-values obtained for each sample, it can be concluded that ad-

justments improved the accuracy of forecasts more frequently than they reduced it.

However, the probability of a successful intervention was rather low for positive

adjustments.

6. Conclusions

The appropriate measurement of forecasting accuracy is important in many or-

ganizational settings, and is not of merely academic interest. Where an inappropri-

ate error measure is used the consequences can be the adoption of a poor forecast-

ing process. In addition forecasters can be rewarded or penalized (appropriately or

not) for their performance and this is evaluated through the organization’s choice

of error measure. Due to the specific features of SKU-level demand data, many

well-known error measures are not appropriate for use in evaluating the effective-

ness of adjustments. This is especially true for fast-moving fashionable products.

In particular, the use of percentage errors is not advisable because of the consider-

able proportion of low actual values, which lead to high percentage errors with no

direct interpretation for practical use. Moreover, the errors corresponding to ad-

justments of different signs are penalised differently when using percentage errors,

because the forecasting errors are correlated with both the actual demand values

Page 30: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

and the adjustment sign. As a result, measures such as MAPE and MdAPE do not

provide sufficient indication of the effectiveness of adjustments, in terms of a lin-

ear loss function. Similar arguments were also found to apply to the calculation of

MASE, which can also induce biases and outliers as a result of using the arithme-

tic mean to average relative quantities. Thus, an organization which determines its

forecast improvement strategy based on an inadequate measure will misallocate its

resources, and will therefore fail in its objective of improving the accuracy at the

SKU level.

In order to overcome the disadvantages of existing measures, it is recommend-

ed that an average relative MAE be used which is calculated as the geometric

mean of relative MAE values. This scheme allows for the objective comparison of

forecasts, and is more reliable for the analysis of adjustments.

For the empirical dataset, the analysis has shown that adjustments improved

accuracy in terms of the average relative MAE (AvgRelMAE) by approximately

10%. For the same dataset, a range of well-known error measures, including

MAPE, MdAPE, GMRAE, MASE, and the MAD/MEAN ratio, indicated conflict-

ing results. The MAPE-based results suggested that, on the whole, adjustments did

not improve the accuracy, while the MdAPE results showed a substantial im-

provement (dropping from 25% to 20%, approximately). The analysis using

MASE and the MAD/MEAN ratio was complicated, due to a highly skewed un-

derlying distribution, and did not allow any firm conclusions to be reached. The

GMRAE showed that adjustments improved the accuracy by 13%, a result that is

close to that one obtained using the AvgRelMAE. Since analyses based on differ-

ent measures can lead to different conclusions, it is important to have a clear un-

derstanding of the statistical properties of any error measure used. We have de-

scribed various undesirable effects that complicate the interpretation of the well-

known error measures. As an improved scheme which is appropriate for evaluat-

ing changes in accuracy under linear loss, we recommend using the AvgRelMAE.

The generalisation of this scheme can be obtained straightforwardly for other loss

functions as well.

One question that arises after the analysis of the accuracy of judgmental ad-

justments is whether or not these adjustments are systematically biased and can we

improve them using some statistical calibration. A number of studies have been

now conducted to address these questions (see, e.g., Davydenko and Fildes, 2008;

Fildes et al., 2009; Franses and Legerstee, 2010; Trapero, Fildes, and Davydenko,

2011) and it has been found that judgmental adjustments do contain persistent sys-

tematic errors. Albeit this topic is outside the scope of the current paper, but, of

course, we think that any study of procedures for the correction of judgmental

forecasts should contain thorough analysis of accuracy based on appropriate and

well-justified error measures.

Page 31: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

The process by which a new error measure is developed and accepted by an or-

ganisation has not received any research attention. A case in point is intermittent

demand, where service improvements can be achieved, but only by abandoning

the standard error metrics and replacing them with service-level objectives (Synte-

tos & Boylan, 2005). When an organisation and those to whom the forecasting

function reports insist on retaining MAPE or similar (as will mostly be the case),

the forecaster’s objective must then shift to delivering to the organisation’s chosen

performance measure, whilst using a more appropriate measure, such as the

AvgRelMAE, to interpret what is really going on with the data. In essence, the

forecaster cannot reasonably resort to using the organisation’s measure and expect

to achieve a cost-effective result.

Page 32: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

Appendix A. Alternative representation of MASE

According to Hyndman and Koehler (2006), for the scenario when forecasts are

made from varying origins but with a constant horizon (here taken as 1), the

scaled error is defined as3

𝑞𝑖,𝑡 =𝑒𝑖,𝑡

MAE𝑖b

, MAE𝑖b =

1

𝑙𝑖 − 1∑ |𝑌𝑖,𝑗 − 𝑌𝑖,𝑗−1|

𝑙𝑖

𝑗=2

,

where MAE𝑖b is the MAE from the benchmark (naïve) method for series 𝑖, 𝑒𝑖,𝑡 is

the error of a forecast being evaluated against the benchmark for series 𝑖 and peri-

od 𝑡, 𝑙𝑖 is the number of elements in series 𝑖, and 𝑌𝑖,𝑗 is the actual value observed

at time 𝑗 for series 𝑖. Let the mean absolute scaled error (MASE) be calculated by averaging the ab-

solute scaled errors across time periods and time series:

MASE =1

∑ 𝑛𝑖𝑚𝑖=1

∑ ∑|𝑒𝑖,𝑡|

MAE𝑖b

𝑡∈𝑇𝑖

𝑚

𝑖=1

,

where 𝑛𝑖 is the number of available values of 𝑒𝑖,𝑡 for series 𝑖, 𝑚 is the total num-

ber of series, and 𝑇𝑖 is a set containing time periods for which the errors 𝑒𝑖,𝑡 are

available for series 𝑖. Then,

MASE =1

∑ 𝑛𝑖𝑚𝑖=1

∑ ∑|𝑒𝑖,𝑡|

MAE𝑖b

𝑡∈𝑇𝑖

𝑚

𝑖=1

=1

∑ 𝑛𝑖𝑚𝑖=1

∑∑ |𝑒𝑖,𝑡|𝑡∈𝑇𝑖

MAE𝑖b

𝑚

𝑖=1

=1

∑ 𝑛𝑖𝑚𝑖=1

∑ 𝑛𝑖

1𝑛𝑖

∑ |𝑒𝑖,𝑡|𝑡∈𝑇𝑖

MAE𝑖b

𝑚

𝑖=1

3 The formula corresponds to the software implementation described by

Hyndman and Khandakar (2008).

Page 33: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

=1

∑ 𝑛𝑖𝑚𝑖=1

∑ 𝑛𝑖𝑟𝑖

𝑚

𝑖=1

, 𝑟𝑖 =MAE𝑖

MAE𝑖b

,

where MAE𝑖 is the MAE for series 𝑖 for the forecast being evaluated against the

benchmark.

Page 34: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg.

References

Alkhazaleh, A. M. H., & Razali, A. M. (2010). New technique to estimate the asymmetric

trimming mean. Journal of Probability and Statistics, vol. 2010.

Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H. & Tuckey, J. W.

(1972). Robust Estimates of Location. Princeton University Press, Princeton, NJ

Armstrong, S. (1985). Long-range forecasting: from crystal ball to computer. New York:

John Wiley.

Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing about forecasting

methods: Empirical comparisons. International Journal of Forecasting, 8, 69-80.

Armstrong, J. S., & Fildes, R. (1995). Correspondence on the selection of error measures

for comparisons among forecasting methods. Journal of Forecasting, 14(1), 67-71.

Chatfield, C. (2001) Time-series Forecasting. Chapman & Hall.

Davydenko, A., & Fildes, R. (2008, June 22-25). Models for product demand forecasting

with the use of judgmental adjustments to statistical forecasts. Paper presented at the

28th International Symposium on Forecasting (ISF2008), Nice. Retrieved from

http://www.forecasters.org/submissions08/DAVYDENKOANDREYISF2008.pdf Davydenko, A., & Fildes, R. (2013). Measuring forecasting accuracy: The case of judgmen-

tal adjustments to SKU-level demand forecasts. International Journal of Forecasting,

29(3), 510-522.

Diebold, F. X. (1993). On the limitations of comparing mean square forecast errors: Com-

ment. Journal of Forecasting, 12, 641-642.

Fildes, R. (1992). The evaluation of extrapolative forecasting methods. International Jour-

nal of Forecasting, 8(1), 81-98.

Fildes, R., & Goodwin, P. (2007). Against your better judgment? How organizations can

improve their use of management judgment in forecasting. Interfaces, 37, 570-576.

Fildes, R., Goodwin, P., Lawrence, M., & Nikolopoulos, K. (2009). Effective forecasting

and judgmental adjustments: an empirical evaluation and strategies for improvement in

supply-chain planning. International Journal of Forecasting, 25(1), 3-23.

Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics: the correct way to

summarize benchmark results. Communications of the ACM, 29(3), 218-221.

Franses, P. H., & Legerstee, R. (2010). Do experts' adjustments on model-based SKU-level

forecasts improve forecast quality? Journal of Forecasting, 29, 331-340.

Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE. Interna-

tional Journal of Forecasting, 4, 405–408.

Hill, M., & Dixon, W. J. (1982). Robustness in real life: A study of clinical laboratory data.

Biometrics, 38, 377-396.

Hoover, J. (2006). Measuring forecast accuracy: Omissions in today’s forecasting engines

and demand-planning software. Foresight: The International Journal of Applied Fore-

casting, 4, 32-35.

Hyndman, R. J. (2006). Another look at forecast-accuracy metrics for intermittent demand.

Foresight: The International Journal of Applied Forecasting, 4(4), 43-46.

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast

package for R. Journal of Statistical Software, 27(3).

Hyndman, R., & Koehler, A. (2006). Another look at measures of forecast accuracy. Inter-

national Journal of Forecasting, 22(4), 679-688.

Kolassa, S., & Schutz, W. (2007). Advantages of the MAD/MEAN ratio over the MAPE.

Foresight: The International Journal of Applied Forecasting, 6, 40-43.

Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International

Journal of Forecasting, 9, 527–529

Page 35: Measuring Forecasting Accuracy

Please cite this paper as:

Davydenko, A., & Fildes, R. (2014). Measuring Forecasting Accuracy: Problems and Recom-

mendations (by the Example of SKU-Level Judgmental Adjustments). In Intelligent Fashion

Forecasting Systems: Models and Applications (pp. 43-70). Springer Berlin Heidelberg. .

Marques, C. R., Neves, P. D., & Sarmento, L. M. (2000). Evaluating core inflation indica-

tors. Working Paper 3-00, Economics Research Department, Banco de Portugal.

Mathews, B., & Diamantopoulos, A. (1987). Alternative indicators of forecast revision and

improvement. Marketing Intelligence, 5(2), 20-23.

McCarthy, T. M., Davis, D. F., Golicic, S. L., & Mentzer, J. T. (2006). The evolution of

sales forecasting management: a 20-year longitudinal study of forecasting practice.

Journal of Forecasting, 25, 303-324.

Mudholkar, G. S. (1983). Fisher’s z-transformation, Encyclopedia of Statistical Sciences, 3,

130–135.

Sanders, N., & Ritzman, L. (2004). Integrating judgmental and quantitative forecasts:

methodologies for pooling marketing and operations information. International Journal

of Operations and Production Management, 24, 514-529.

Spizman, L., & Weinstein, M. (2008). A note on utilizing the geometric mean: when, why

and how the forensic economist should employ the geometric mean. Journal of Legal

Economics, 15(1), 43-55.

Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand estimates. In-

ternational Journal of Forecasting, 21(2), 303-314.

Trapero, J.R., Fildes, R.A., & Davydenko A. (2011). Non-linear identification of

judgmental forecasts at SKU-level. Journal of Forecasting, 30(5), 490-508.

Trapero, J. R., Pedregal, D. J., Fildes, R., & Weller, M. (2011). Analysis of judgmental ad-

justments in presence of promotions. Paper presented at the 31th International Symposi-

um on Forecasting (ISF2011), Prague.

Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases.

Science, 185, 1124-1130.

Wilcox, R. R. (1996). Statistics for the social sciences. San Diego, CA: Academic Press.

Wilcox, R. R. (2005). Trimmed means. Encyclopedia of Statistics in Behavioral Science, 4,

2066-2067.

Zellner, A. (1986). A tale of forecasting 1001 series: The Bayesian knight strikes again. In-

ternational Journal of Forecasting, 2, 491-494.

View publication statsView publication stats