The 2 m-temperature anomalies from the reforecasts of the CNR-ISAC and ECMWF monthly prediction systems have been combined in a multimodel super-ensemble. Tercile probability predictions obtained from the multimodel have been constructed using direct model outputs (DMO) and model output statistics (MOS), like logistic and nonhomogeneous Gaussian regression, for the 1990–2010 winter seasons. Verification with ERA-Interim reanalyses indicates that logistic regression gives the best results in terms of ranked probability skill scores (RPSS) and reliability diagrams for low–medium forecast probabilities. Also, it is argued that the logistic regression would not yield further improvements if a larger dataset was used.

Multimodel ensemble forecasting has been proven to be successful on different space-time scales (e.g. Casanova and Ahrens, 2009). Using both a synthetic forecast generator and a seasonal forecast dataset, Weigel et al. (2008) showed that multimodel combination reduces overconfidence, i.e. ensemble spread is widened while average ensemble-mean error is reduced, implying a net gain in prediction skill. One of the issues of the sub-seasonal to seasonal prediction (S2S) research project, whose goal is to improve forecast skill and understanding on this timescale (Vitart et al., 2017), is the assessment of the benefits of a multimodel forecast, and how it can be constructed and implemented. The S2S database provides a set of reforecasts over a past period for model calibration purposes. Each model participating to the S2S database is furnished by a large number of reforecasts collected on many initialization dates with a typically small number of ensemble members. Then, it arises the question of how to exploit at best these reforecast sets in order to improve the skill of multimodel combinations, in particular for what concerns probability predictions.

In this work, we address this question by constructing a multimodel combination between two of the S2S models, namely the ECMWF-IFS and the CNR-ISAC monthly forecasting systems. When dealing with a small number of ensemble members as in the present case, it must be taken into account that model output statistics (MOS) based on ensemble mean quantities, and calibrated over a long retrospective dataset of reforecasts, can provide skilful probabilistic predictions that may be competitive with those obtained by direct model output (DMO) counting algorithms. Whitaker et al. (2006) give a clear example of MOS techniques applied to multimodel reforecast datasets obtained from ECMWF-IFS and NCEP-GFS global models. The main purpose of this work is then to assess which techniques work best in this particular context.

The paper is organized as follows. In Sect. 2, we describe the datasets. In this study, we limit our attention to the 2 m-temperature prediction for the winter season. In Sect. 3, we illustrate how the two-model ensemble is devised. Rather than constructing a poor man or simple multimodel, as customary in similar studies, we optimally combine the two models in a super-ensemble by computing the (unconstrained) weighting coefficients via a linear regression. In Sects. 4 and 5 we compare the skill of different techniques used to predict the probability that the 2 m temperature falls in a given tercile. In Sect. 6, by means of learning curves, we determine the convergence of the results. Conclusions are presented in Sect. 7.

The reforecast simulations of the CNR-ISAC monthly prediction system, as originally created for calibration purposes in the S2S database (Vitart et al., 2017), constitute a “fixed” dataset. It covers the 30-year period ranging from 1981 to 2010, with initialization dates every 5 days from 1 January to 27 December of each year (in leap years, a 6-day gap between 25 February and 2 March is used). One simulation is obtained initializing the GLOBO model (Malguzzi et al., 2011) with ERA-Interim (Dee et al., 2011) reanalysis data at 00:00 UTC of each date. For this study, the reforecast dataset has been enlarged to a 5-member lagged ensemble in which the 4 new initial conditions are taken, from ERA-Interim, at the initialization date plus or minus 6 and 12 h. We restrict the analysis to the Northern Hemisphere winter season by choosing initialization dates in the months of December, January, and February.

The ECMWF-IFS monthly forecasting system currently produces reforecasts “on
the fly” covering the past twenty years twice a week, every Monday and
Thursday. For this study, we consider the 5-member ensemble generated with
the ECMWF-IFS up to version CY40R1 (e.g. Vitart, 2014). Reforecasts of
2 m temperatures are downloaded from the MARS archive at the resolution of
1.5

By focussing on the winter season only, we can find 268 common initialization dates between the GLOBO and IFS reforecasts for the years between 1990 and 2011. However, the 1990–1991, 1991–1992, and 2010–2011 winter seasons contain few common dates, so they are excluded from the subsequent analysis, leaving a total of 258 cases distributed in 18 winters (from a minimum of 10 to a maximum of 15 cases per winter).

In both datasets, the 2 m temperature is taken at 00:00 and 12:00 UT for a
forecast period of 28 days, and then averaged over the leading weeks from 1
to 4. The same average is computed on ERA-Interim reanalysis of 2 m
temperature on the 1.5

Sum of the regression coefficients.

In order to compute the multimodel anomaly fields, some preliminary
operations have to be performed. We start by taking the ensemble mean of
both ISAC-CNR and ECMWF-IFS models and the anomalies with respect to the
corresponding model climate, which is function of validity time. Thus, let

The sum of the regression coefficients, displayed in Fig. 1, deserves some
considerations. This quantity does not have a particular physical meaning,
but a rule of thumb can be derived: when the sum is close to zero, the
climatology is the best possible forecast given our data. Some interesting
features can be seen in the maps: there are two maxima over part of Asia and
the Antarctica, where the sum reach values greater than 1.5 and increases
with the week number. Some other maxima, with smaller values, can be seen
over the oceans near the equator, where

The multimodel performance can be evaluated by adopting a cross-validation
approach (Wilks, 2011), which divides the whole dataset into a training and
a validation portion. The dataset is split into single boreal winter
seasons, each of them including December from one year and January and
February from the subsequent one. All winters are used for the

The multimodel skill in terms of anomaly correlation (AC), computed over
typical spatial regions, is shown in Table 1. The chosen regions are the
extra-tropical Northern Hemisphere (NH, 20 to 90

Anomaly correlation coefficient averaged over the 18 validation winters. The four pair of columns report the spatial average over the different regions described in the text. Each pair contains the results for the multimodel (MM) and the best result between the two single models (BM). The first column reports the lead week for the entire row. The value corresponding to the best performance is highlighted bold.

A comprehensive forecasting system for a long scale such as the sub-seasonal one should also convey probabilistic information. In this section, we introduce the techniques used to compute the predicted tercile probabilities, namely the probability that the 2 m temperature anomaly falls in each of the three regions corresponding to the first, second, and third tercile of the super-ensemble climatological distribution. We used two different groups of techniques, described in the next subsections. Example of the usage of the same techniques can be found in Hamill et al. (2004), Whitaker et al. (2006), Wilks (2006) and Wilks and Hamill (2007).

In the direct model output (DMO) techniques, the probability estimation is
based on the direct count of ensemble members that are below a given
threshold. There are different ways to define the super-ensemble members
(Stefanova and Krishnamurti, 2002); here, an alternative approach is
proposed, in which all possible pairs of members of the two model ensembles
are linearly combined with weights

The components of the DMO group are the Democratic Voting (DV) method and
the Tukey Plotting Position (TPP). In both cases, the probability

Another possibility is to define the super-ensemble members as the union of the
members of each model, weighted by the corresponding regression coefficient.
In this case, the DV probability estimate becomes:

Ranked probability skill score computed with the Tukey plotting
position (RPSS

Logistic regression (LR) is the first model output statistic (MOS) that we
consider. We test three “hypothesis functions”, in which the probability
is computed starting from different combination of the super-ensemble mean

The other MOS that we test is the nonhomogeneous Gaussian regression, which
is defined as:

The aim here is to evaluate the skill of the probability hindcast defined above. For the calibration and verification of the probability hindcast using MOS techniques we adopt the same cross-validation approach used in Sect. 3. The Ranked Probability Skill Score (RPSS) is computed as a “single-value estimate” for all terciles. Then, we examine the reliability diagrams for the lower tercile in order to analyse the full joint distribution of forecasts and verifying reanalysis. Example of the usage of these verifying tools in the can be found in Hamill et al. (2004), Wilks (2006), Wilks and Hamill (2007). A full explanation is presented in Wilks (2011).

Reliability diagrams for the four methods (see text) applied to the
lower tercile multimodel prediction. The four panels refer to the forecast weeks
(average over the whole globe and over the 18 validation winters). Democratic
voting (DV): green; Tukey plotting position (TPP): cyan; logistic regression (LR):
red; and nonhomogeneous Gaussian regression (NGR): blue. On the

Learning curve for the LR (using the ensemble mean only) for weeks
from 1 to 4. The

Table 2 reports the values of RPSS obtained in three cases. For the DMO group, only TPP s reported because it generally gives slightly higher scores than DV. For the MOS group, LR1 (logistic regression with the super-ensemble mean as the only predictor) and NGR are considered. As it can be easily seen from Table 2, LR1 shows better performances in all the four chosen regions and for all forecast periods. It is interesting to notice that over the Euro-Atlantic region, while the DMO techniques give us almost no predictability after the second week, the use of LR1 improves significantly the skill that can be extracted from the multimodel.

In addition to RPSS, we show the reliability diagrams for the lower tercile in Fig. 2 where, for brevity purposes, we condensed in a single set of graphs, and for the whole globe, the diagrams obtained using DV, TPP, LR1, and NGR; very similar results are obtained for the upper tercile (not shown). Overall, the multimodel probabilistic prediction overforecasts the observed occurrences, especially in the first two weeks for medium–high probabilities. In the remaining weeks, the overforcasting tendency is reduced for medium probabilities, while underforecasting appears with DMO techniques for the lowest probability, which is also the most frequent as indicated in the bar graphs showing the refinement distributions.

The advantage of the MOS techniques is evident along the whole forecasting range for all the probabilities except the highest one (0.8–1.0). In the first week, LR1 (the red curve) is the closest to the bisector of the quadrant, improving the low reliability of the medium forecast probabilities. In the remaining weeks, the MOS techniques reduce the lowest probability frequency (see the refinement distributions), improving the resolution of the forecast distributions and their reliability for low–medium forecast probabilities. In particular, the underforecasting resulting from the DMO techniques for the lowest probability is completely removed. Between the two MOS techniques, LR1 systematically outperforms NGR for the 0.0–0.6 probability range.

For the highest probability value considered, DMO methods outperform MOS techniques. Specifically, in the extended range, LR1 seems unable to produce high-probability forecasts, as evident from the strong reduction of the associated forecast frequency.

In summary, LR1 drastically improves the low probability forecasts, slightly improves medium probability forecasts and is unable to provide skilful high probability forecasts.

After having shown that the LR technique provides the best performance, except for the high-probability forecasts, our aim in this section is to determine the possible occurrence of over-fitting or under-fitting, and therefore the potential for improving this particular technique. The method of the learning curve consists in repeating the computation of LR coefficients for different dimensions of the training set (10, 20, 30, 40, 50, 100, 150, and 200 elements). With the recomputed LR coefficients, we evaluate the RPSS averaged over the globe for each of the four forecast weeks. The results are summarized in Fig. 3. As expected, for small training sets the algorithm performs exceptionally well in the training phase, while in the validation mode it shows poor results. The two values become closer for increasing training set dimensions and, above the threshold of 100 elements, the scores remain nearly constant. This is a definite indication that the LR has extracted all the available information from the data. However, it cannot be excluded that there actually is some additional information that LR is not capable to catch or, in other words, that LR is under-fitting. In this case, the results do not improve by adding more training elements, and different solution should be adopted, like using more complex fittings.

The aim of this study is the evaluation of the predictive skill of the super-ensemble obtained by combining through linear regression the reforecasts of the ECMWF-IFS and CNR-ISAC monthly prediction systems. The main focus is, in particular, on probabilistic wintertime 2 m temperature predictions: we compare different techniques for extracting forecast tercile probabilities from the multimodel ensemble, namely direct model output (DMO) and model output statistics (MOS). We conclude that the logistic regression based on the ensemble mean as only predictor gives the best performance in terms of RPSS and reliability diagrams for low–medium forecast probabilities.

We finally analyse the behaviour of logistic regression with different dimensions of the training dataset. We conclude that no further improvements are possible from an extension of the dataset, unless a more complex algorithm is used.

The results here presented depend on the rather limited number of ensemble members at our disposal. Conclusions may be quite different if more members are added to each single model, or if more models are considered. However, the outcomes obtained in the rather simple framework described here suggest that the multimodel ensemble could be easily implemented to improve CNR-ISAC operational monthly forecasts.

The ECMWF reforecasts and ERA-Interim reanalyses were downloaded, after log
in, through the MARS catalogue

The authors declare that they have no conflict of interest.

The authors wish to thank an anonymous reviewer for the useful suggestions that helped to improve the manuscript. This work was partly supported by the Italian Civil Protection Agency. Edited by: Á. G. Muñoz Reviewed by: two anonymous referees