Quantifying and Testing Accuracy

The various results are organized in the form of tables and can be copied and pasted externally. As shown in the video, one can focus the evaluation on a particular subsample by simply clicking on the Filter sample button in the upper-right corner of the table:

A. Quantifying

The forecasts errors resulting for a given subsample are averaged in different ways, as shown in the following table:

Root Mean Squared Mean Absolute Median Absolute
Scale Dependent RMSE MAE MdAE
(Percentage) Errors RMS(P)E symmetric MA(P)E symmetric MdA(P)E
(Scaled) Errors RMS(S)E MA(S)E MdA(S)E
  • The first row represents the widely used Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mediean Absolute Error (MdAE). For a precise definition, check for example Hyndmnn and Koehler (2006). Given that those measures depend on the scale of the data, they are useful only when comparing different methods at forecasting the same set of variables.
  • In order to be able to compare different methods across data sets that have different scales, one can look at percentage errors (second row). The percentage error is undefined it the variable happens to hits close enough to the value zero. Moreover, it has been argued that the MA(P)E and MdA(P)E put a hevier penalty on positive errors than on negative errors. This problem is solved as in Makridakis (1993) by slightly modifying the definition of percentage error (the “symmetric” absolute percentage error is defined by dividing the absolute error by the sum of the actual value and forecast).
  • The third row contains scaled errors, which is the solution proposed by Hyndmnn and Koehler (2006) to quantify forecasting accuracy independently of scale. In practice, the scaled errors are obtained by dividing the errors by the mean absolute error of the random walk over the evaluation sample. Therefore, the three measures will be smaller than one if the forecasts are more precise than those of a random walk.

Our proposed way to eliminate the scale dependency is to simply use an ARIMA model as a naïve benchmark against which our forecasts can be compared. Thus, the three blocks of measures obtained for a given nowcasting model are divided by those resulting from a univariate benchmark (ARIMA) recursively specified and estimated for each time series by the most recent implementation of the TRAMO algorithm:

B. Testing

The algorithms used in the current version of JDemetra+ represent the first attempt to let users decide whether the nowcasts produced at a given point in time are significantly different from those coming from a univariate benchmark. For detailed information, click on the technical note “A forecasting evaluation library for JDemetra+.

Overview of all tests

Summing up, we follow the original formulation of the Diebold-Mariano (henceforth DM) test, which looks at the forecast errors differentials to test the null hypothesis that the difference between our (e.g. squared) errors and those of a benchmark (i.e. loss differential) are equal to zero. The test statistic is defined as the sample average of the loss differential, which should be close to zero under the null, divided the the Newey-West HAC estimator of the variance, and converges asympotically to a standard normal disbribution.

The DM test has been used since 1995 in spite of its well known size distorsion (i.e. the hypothesis that our forecasts are significantly different from those of the benchmark will be rejected by mistake more often than expected from the test design). The oversize problem, which can be more visible in small samples, is fixed by deviating from the standard asymptotic theory: we follow the proposal by Coroneo and Iacone (2016), which is referred to as Fixed Smoothing Asymptotics (henceforth, FSA) . Based on this idea, our evaluation of point forecasts is reported as follows:

Test Null hypothesis Report Colour code for rejection
Diebold-Mariano (Standard vs FSA) equality in forecasting accuracy pvalue blue
Forecast encompasses Benchmark (FSA) our model predictions encompass the benchmark forecasts weight of the benchmark red (weight significantly different from zero)
Forecast is encompased by Benchmark (FSA) our model predictions are encompassed by those of the benchmark weght of our forecast green (weight significantly different from zero)
Bias (FSA) our forecasts are not biased (i.e. the average error is significantly different from zero) average error red
Autocorrelation (FSA) our forecasts are not biased (i.e. the average error is significantly different from zero) autocorrelation of errors red

Warnings:

  • As highlighted by Diebold (2015), one should be warned against the temptation of using the DM test for model comparisons. The test, which is based on a given subsample of simulated out-of-sample forecast errors, is mainly useful to evaluate predictive accuracy during particular historical episodes.
  • The graphical interface allows the user to change the evaluation period in the blink of a click. Thus, it is very easy to check whether the results are robust or not to the addition or deletion of parts of the sample. By discarding the possibility that results are driven by the choice of the evaluation sample, one can insure against the oversize problem. Formal robust tests for evaluating accuracy have been independently proposed by Hansen and Timmerman (2011) and Rossi and Inoue (2012).

Visualization in JDemetra+

The figure below shows that the tests results are displayed in the lower part of the table.

  • As it can be shown below for the case of German GDP growth, the p-values of the DM test are slightly below 0.2 when the forecast is made from 68 to 60 days before the end of the quarter. Results for information sets ranging from -59 to 0 (end of the quarter) and from +1 to +44 (one day before the official GDP release) are not visible in this image, but they can displayed within JDemetra+ by moving the lower bar to the right.
  • When the FSA-DM test is used, the exact p-values are not displayed. In this case, we simply fill the cells with an increasingly saturated colour tone when the p-values are within the intervals [0.2, 0.1), [0.1, 0.05),[0.05, 0), respectively. No colour is applied when the p-value is larger than 0.2; which happens to be the case in our example.
  • As mentioned in the summary table above, the values reported for the test “our model encompasses the benchmark” are associated to the weight of the benchmark, which turns out to be close to zero in our example, so we do not reject the null hypothesis.
  • The next row corresponds to the test “our model encompasses the benchmark”, which contains the weights associated to our model, which become increasingly close to one as we approach the end of the quarter we are forecasting. In the example, we can see that 66 days before the end of the quarter, the green colour indicates that the null hypothesis is rejected.
  • The next two rows test whether bias and autocorrelation are significant. It is not the case in our example.

For further clarifications regarding the interpretation of the tests and their implementation, please read the technical note “A forecasting evaluation library for JDemetra+”.

References used