{"title":"Discussion on Assessing Predictability of Environmental Time Series With Statistical and Machine Learning Models","authors":"Ansgar Steland","doi":"10.1002/env.70003","DOIUrl":null,"url":null,"abstract":"<p>I congratulate the authors for their interesting and insightful paper. Their findings contribute to the ongoing discussion of the pros and cons of machine learning methods and statistical approaches in environmetrics. In my discussion, I want to address some issues in the design of the comparison study and the interpretation of the results, and add some general thoughts. Moreover, I report about my own analyses of the Indiana citizen science data set used by the authors. Specifically, I developed a non-linear ARIMA regression model with improved heteroscedasticity-consistent uncertainty estimation, which turns out to substantially outperform the best method of (Bonas et al. <span>2025</span>). I also applied a couple of machine learning approaches not examined there, which I regard as quite accessible general purpose machine learning methods. One of those methods is recommended by the early 2025 state-of-the-art AI large language models. I report about the interesting results in the last section.</p><p>The authors argue that for non-linear and non-stationary processes, machine learning methods are typically superior due to their non-parametric structure. But here one should recall that the classical but still popular class of feedforward artificial neural networks trained by backpropagation is non-linear least squares regression with a certain non-linear regression function known up to a parameter vector. In its entirety, this was mainly recognized and elaborated by the econometrician Halbert White, see White (<span>1992</span>). And, of course, there are many methods classified as traditional statistical approaches which are non-parametric as well. The differences are more with respect to the dimensionality and sample size, and how the methods deal with it. Many machine learning methods such as artificial neural networks fit high-dimensional overparametrized parametric models, whereas statistical methods usually work with specifically chosen stochastic models aiming at parsimony. Such ML methods process high-dimensional inputs as given, and the purpose of the first stage of a model is to learn features from the input data, sometimes being agnostic with respect to the type and meaning of the input variables. The guiding principle is that one only fixes the basic topology (e.g., fully-connected layers or convolutional layers), and the model can extract optimal features provided the learning sample is large enough. This is in contrast to statistical approaches, which tend to design a model for specific types of inputs and deal with high dimensionality by suitable preprocessing steps, human-controlled feature generation and/or specific assumptions and related estimation methods, especially sparse models and sparse estimation, which combine estimation and variable selection. A further distinction is that statistical models and modeling using stochastic processes often impose a certain structure, which is derived from domain knowledge (e.g., physical theory), and are developed and studied to understand the data generating process. Often, model components and parameters have clear interpretations, and one follows the parsimony principle (Occam's razor) that a model (and, more generally, a scientific theory/explanation) should be as complicated as necessary, but as simple as possible. Non-parametric methods typically assume much less prior knowledge, and often they are designed having in mind clear interpretations as well. However, there is a substantial overlap and intersection between machine learning and statistical methods. Many approaches are extensively used and studied in both fields, for example, ridge and lasso regression, decision tree methods and (orthogonal) basis expansions. Further, more and more hybrid approaches are developed and studied, which fuse statistical and machine learning approaches and thinking.</p><p>In their study (Bonas et al. <span>2025</span>), observed that the machine learning methods provide better predictions than the statistical approaches they considered. They partially attribute this inability of statistical models to certain stylized facts of the data. Especially, they argue that environmental data such as temperature or wind often show up long-memory effects as visible in the plots of the autocorrelation function (ACF), which are beyond the scope of statistical models such as ARIMA models. Firstly, as I discuss below to some extent, one may easily find ARIMA regression models which outperform the best method of (Bonas et al. <span>2025</span>) (Echo-State-Network). Secondly, as I show in the next section, the observed long-memory is likely to be spurious and thus cannot explain underperformance.</p><p>The authors conclude from the ACF plots, which show a very slow decay of autocorrelations, that the temperature and wind series of the Indiana citizen science data set, especially temperature, have long memory. They argue that therefore statistical time series models assuming short memory (such as ARIMA) fail.</p><p>I think this conclusion is doubtful. Clearly, wind and temperature belong to the variables that could possess long memory. But long memory and non-stationarity are very difficult to separate, since the latter influences the classical quantities to define and analyze long memory. Slowly decaying autocorrelations can also be explained by (random) level shifts or smooth trends. When naively calculating acf functions from raw data, the series are centered at their arithmetic mean calculated from the whole data. But when the mean is not stationary, this induces biased estimates of the autocorrelation function. Further, tests for non-stationarity, for example, change-point tests, tend to reject the null hypothesis of no change if the underlying series has long memory.</p><p>When <i>locally</i> demeaning the temperature and wind series by subtracting the average of the most recent three measurements, a large part of the slow decay is removed, as can be seen from Figure 1, which should be compared to Figure 4 in (Bonas et al. <span>2025</span>). Correcting by an average calculated over larger windows provides essentially the same result. Bonas et al. (<span>2025</span>) corrected for non-stationarity by only removing the 12 and 24 h harmonics, and this might not be sufficient.</p><p>If one relies on a time series linear regression with ARIMA errors, as in (Bonas et al. <span>2025</span>), one should estimate the autocorrelations from the residuals. When analyzing the residuals from the prediction model of (Bonas et al. <span>2025</span>), the picture is essentially the same. Actually, it even provides a better removal of non-stationarity, as can be seen from the bottom row of Figure 1.</p><p>Although there is some general evidence that temperature and wind might be long memory processes, I think that the data at hand can be suitably modeled by a conditional mean component disturbed by short memory noise.</p><p>Models including a linear predictor such as ARIMA regressions go far beyond the simple inclusion of the input variables as given in the data set to be analyzed. Indeed, a suitable choice of standard non-linear transformations often leads to a substantial improvement and a performance which is comparable to methods which learn non-linear influences of the inputs from the data. This especially applies to data sets with moderate sample sizes, where non-parametric estimation of the true non-linear relationships with high-accuracy is hard per se (as indicated by the known slow convergence rates).</p><p>Non-linear transformations well-proven in practice are power transformations such as <span></span><math>\n <semantics>\n <mrow>\n <msup>\n <mrow>\n <mi>x</mi>\n </mrow>\n <mrow>\n <mn>2</mn>\n </mrow>\n </msup>\n <mo>,</mo>\n <msup>\n <mrow>\n <mi>x</mi>\n </mrow>\n <mrow>\n <mn>3</mn>\n </mrow>\n </msup>\n <mo>,</mo>\n <mi>…</mi>\n </mrow>\n <annotation>$$ {x}^2,{x}^3,\\dots $$</annotation>\n </semantics></math> and their linear combination, that is, polynomials. They are mathematically justified by Taylor series expansions of smooth functions. Moreover, logarithms, <span></span><math>\n <semantics>\n <mrow>\n <mi>log</mi>\n <mo>(</mo>\n <mi>x</mi>\n <mo>)</mo>\n </mrow>\n <annotation>$$ \\log (x) $$</annotation>\n </semantics></math> or <span></span><math>\n <semantics>\n <mrow>\n <mi>log</mi>\n <mo>(</mo>\n <mi>x</mi>\n <mo>+</mo>\n <mn>1</mn>\n <mo>)</mo>\n </mrow>\n <annotation>$$ \\log \\left(x+1\\right) $$</annotation>\n </semantics></math>, with respect to some suitable basis can be considered as well as square roots, <span></span><math>\n <semantics>\n <mrow>\n <msqrt>\n <mrow>\n <mi>x</mi>\n </mrow>\n </msqrt>\n </mrow>\n <annotation>$$ \\sqrt{x} $$</annotation>\n </semantics></math>, or, more generally, fractional roots.</p><p>Another important fact is that often <i>increments</i> instead of <i>levels</i> matter and are key variables for explanation and prediction, respectively. Then first-order increments, <span></span><math>\n <semantics>\n <mrow>\n <mi>Δ</mi>\n <mi>x</mi>\n </mrow>\n <annotation>$$ \\Delta x $$</annotation>\n </semantics></math>, should be considered. For example, there are physical arguments to analyze increments of wind speed: Local thermal wind is known to be the result of temperature changes. Again, going beyond simple differences, one may take powers, logs and roots of <span></span><math>\n <semantics>\n <mrow>\n <mi>Δ</mi>\n <mi>x</mi>\n </mrow>\n <annotation>$$ \\Delta x $$</annotation>\n </semantics></math> to study non-linear effects.</p><p>In other problems, <i>percentages</i>, <span></span><math>\n <semantics>\n <mrow>\n <mi>Δ</mi>\n <mi>x</mi>\n <mo>/</mo>\n <mi>x</mi>\n </mrow>\n <annotation>$$ \\Delta x/x $$</annotation>\n </semantics></math>, and non-linear transformations thereof may be the regressors one needs to analyze instead of levels or differences. In environmetrics, this specifically applies when analyzing monetary variables, for example, the costs associated with <span></span><math>\n <semantics>\n <mrow>\n <mi>C</mi>\n <msub>\n <mrow>\n <mi>O</mi>\n </mrow>\n <mrow>\n <mn>2</mn>\n </mrow>\n </msub>\n </mrow>\n <annotation>$$ C{O}_2 $$</annotation>\n </semantics></math> emissions or damage due to heavy rain, or the value of renewable energies (wind, solar) which result from wind and temperature, respectively. Here, one can draw from econometrics where usually log returns are considered and form stationary series, whereas prices are usually non-stationary.</p><p>Asymmetry effects can be analyzed with indicator variables which are one if <span></span><math>\n <semantics>\n <mrow>\n <mi>x</mi>\n </mrow>\n <annotation>$$ x $$</annotation>\n </semantics></math> exceeds a certain quantile and zero, otherwise. In a similar way, change-points, <span></span><math>\n <semantics>\n <mrow>\n <mi>τ</mi>\n </mrow>\n <annotation>$$ \\tau $$</annotation>\n </semantics></math>, can be modeled by adding an indicator to the linear predictor which is one if <span></span><math>\n <semantics>\n <mrow>\n <mi>i</mi>\n <mo>≥</mo>\n <mi>τ</mi>\n </mrow>\n <annotation>$$ i\\ge \\tau $$</annotation>\n </semantics></math> and zero, otherwise.</p><p>To summarize, experimenting with those standard non-linear transformations can lead to parsimonious and interpretable prediction models performing on par with highly sophisticated approaches. Specifically, in a separate paper motivated by this discussion, I propose a Non-linear Robust ARIMA method which outperforms the echo-state neural networks found by the authors to be the best for the Indian citizen data.</p><p>The authors publish the data sets they used as benchmark sets for the comparison study as well as large parts of the code used for the analyses.<sup>1</sup> This is highly appreciated, since it allows for further studies. As it seems, although there are many institutions which provide environmental data for the public, there are not yet established repositories of environmental benchmark data sets. From a more statistical viewpoint, the purpose of such data sets is mainly to illustrate a method and its properties, whereas comparisons with competitors are usually done by simulation studies for well-selected distributional models. Analyzing example data is just one example, but simulating 10,000 samples from, say, five data generating processes provides 50,000 examples. But in machine learning, example data sets really matter, and, usually, a (new) method is only accepted if it performs well on example data sets.</p><p>In (Bonas et al. <span>2025</span>) the data is divided into a learning sample for training (estimation) and a sample for testing. Unfortunately, an explicit validation sample is missing. Although many methods internally apply cross-validation techniques to have validation data at their disposal, a further validation sample is preferable. Since the validation samples generated by cross-validation methods (or similar approaches having different names) are a (randomly chosen) part of the training sample, the validation results are still more or less in-sample estimates. Evaluating a model using a validation sample prevents overfitting and overly optimistic in-sample summary statistics (MSE, <span></span><math>\n <semantics>\n <mrow>\n <msup>\n <mrow>\n <mi>R</mi>\n </mrow>\n <mrow>\n <mn>2</mn>\n </mrow>\n </msup>\n </mrow>\n <annotation>$$ {R}^2 $$</annotation>\n </semantics></math>, CPRS, etc.). In machine learning, validation samples are often used to evaluate a model during training, to select hyperparameters by further optimizations, and/or select a model (class). Ideally, test samples should only be used to evaluate the final model. Moreover, having access to an additional validation sample allows the identification of various critical issues, especially non-stationarity or change-points after the training period, which makes extrapolation a delicate endeavor, and the identification of poor coverage of confidence and prediction intervals, respectively. Further discussions of the role of validation samples and its usage for evaluation, model selection and constructing valid prediction intervals, can be found in (Dewolf et al. <span>2023</span>; Tian et al. <span>2022</span>; Steland and Pieters <span>2022</span>), and the references given there.</p><p>Certainly, any statistician, whether an applied one or a mathematical statistician, agrees that the quantification of the uncertainty of a data-based decision is crucial, which has been historically addressed by statistics as a science.</p><p>In terms of statistical theory, this usually breaks down to hypothesis testing in one of the many flavors of statistical decision science, and to reporting confidence resp. prediction intervals or credible intervals of some kind, when it comes to estimation of parameters or prediction of random variables.</p><p>Concerning machine learning, the situation has been considerably improved, and the authors have contributed to the development of a proper method of uncertainty quantification for a class of echo-state neural networks.</p><p>In a comparison study, where a test sample is defined, one essentially looks at the performance measures such as coverage probability or interval score. But in a real-world application there is no test sample, and, given the measurements up to now, one looks at the prediction and the associated prediction interval quantifying the uncertainty. Now the question arises, how to interpret the prediction interval.</p><p>However, there are various sources of uncertainty resulting in relevant variation, and depending on the selected method of prediction the interpretation of the prediction interval may differ as well. Important sources of variation are uncertainty with respect to the choice of a model class (e.g., parametric regression or non-parametric regression model), uncertainty with respect to the concrete specification within a model (e.g., selection of the right regressors), and uncertainty in terms of the estimation error. In environmetrics, the location error also arises, namely when the measurement locations differ from the locations where we want to predict, (Cressie and Kornack <span>2003</span>).</p><p>Concerning the interpretation of a prediction interval, statistical approaches can broadly be classified as being of a frequentist nature or having a Bayesian flavor. In frequentist statistics, given covariates, the predictor is unknown but fixed and disturbed by a random error term yielding the response, and a prediction interval for the predictor is a random interval which may or may not cover the true predictor. In a Bayesian framework, it may also be a random quantity having its own distribution, and the prediction interval may consist of quantiles of its posterior distribution, allowing the interpretation that the probability that the predictor lies in the prediction interval equals the required level. Clearly, when prediction of the response is of interest, the influence of the error is taken into account in both approaches.</p><p>In ensemble methods, the prediction interval is obtained from a constructed or simulated set of ensemble predictions. The ensemble of prediction models can and is often interpreted as a set of experts delivering different predictions, which are then combined to produce a better prediction by leveraging the swarm intelligence of the ensemble. Now, one generates an ensemble of ensemble predictions to construct a prediction interval. Here, the prediction interval may have the interpretation of a range of possible ensemble predictions of a hypothesized universe of ensemble experts, excluding the most extreme experts.</p><p>Finally, the rise of generative models in machine learning has led to approaches to uncertainty quantification where given the data, the model-inherent generative sampling mechanism is used to produce a set of outputs representing the uncertainty.</p><p>To summarize, one needs to take a closer look at how the prediction interval is calculated, in order to understand its interpretation.</p><p>The question arises as to whether the differences between prediction models in terms of the mean squared error are statistically significant in some sense. Especially, since for some pairs of methods the performance measures are quite close to each other, I want to discuss this issue for the mean squared error in prediction, because it is quite commonly used among researchers and practitioners. For large classes of models and methods of statistical estimation, asymptotic tests for pairwise comparisons of dynamic non-linear models have been proposed in (Rivers and Vuong <span>2002</span>) based on in-sample as well as out-of-sample lack-of-fit criteria. The latter approach uses the model fit of the training sample and evaluates its performance in the test sample.</p><p>As an example, let us consider the temperature measurements of the Indiana citizen science data set. The in-sample MSE of the ARIMA prediction model of (Bonas et al. <span>2025</span>) is <span></span><math>\n <semantics>\n <mrow>\n <mn>48</mn>\n <mo>.</mo>\n <mn>01</mn>\n </mrow>\n <annotation>$$ 48.01 $$</annotation>\n </semantics></math>, and a <span></span><math>\n <semantics>\n <mrow>\n <mn>95</mn>\n <mo>%</mo>\n </mrow>\n <annotation>$$ 95\\% $$</annotation>\n </semantics></math> confidence interval is given by <span></span><math>\n <semantics>\n <mrow>\n <mo>[</mo>\n <mn>36</mn>\n <mo>.</mo>\n <mn>03</mn>\n <mo>,</mo>\n <mn>60</mn>\n <mo>.</mo>\n <mn>9</mn>\n <mo>]</mo>\n </mrow>\n <annotation>$$ \\left[36.03,60.9\\right] $$</annotation>\n </semantics></math>. When analyzing the out-of-sample performance of the model fit using the test sample of size <span></span><math>\n <semantics>\n <mrow>\n <mn>100</mn>\n </mrow>\n <annotation>$$ 100 $$</annotation>\n </semantics></math>, one obtains for the ARIMA prediction model of (Bonas et al. <span>2025</span>) the confidence interval <span></span><math>\n <semantics>\n <mrow>\n <mo>[</mo>\n <mn>31</mn>\n <mo>.</mo>\n <mn>18</mn>\n <mo>,</mo>\n <mn>58</mn>\n <mo>.</mo>\n <mn>12</mn>\n <mo>]</mo>\n </mrow>\n <annotation>$$ \\left[31.18,58.12\\right] $$</annotation>\n </semantics></math> around the out-of-sample MSE <span></span><math>\n <semantics>\n <mrow>\n <mn>44</mn>\n <mo>.</mo>\n <mn>65</mn>\n </mrow>\n <annotation>$$ 44.65 $$</annotation>\n </semantics></math>, which differs from the MSE obtained when refitting the model at each time point using all available data. Both confidence intervals were calculated using a Bartlett-type estimator of the long-run-variance as above. This shows that the uncertainty of the MSE is quite large, and one cannot be sure whether the corresponding out-of-sample MSEs of, say, ARIMA and Echo-State-Network, are significantly different and thus due to a better expected performance of echo state networks, or whether they differ simply due to randomness such that their ordering can not be interpreted for the data at hand.</p><p>To obtain highly selective results, one needs to consider larger training samples than considered here, possibly in the range of several years.</p><p>This section reports about the analyses I conducted for the Indiana citizen science data. First, I present a suitable non-linear ARIMA regression model. Then, I briefly describe some machine learning methods I consider to offer a good compromise between flexibility, ease of handling and required computational costs for predicting environmental data on a not-too-large scale.</p><p>The first approach is an improvement of extreme gradient boosting and belongs to the class of ensemble methods which combine decision trees, an idea which can be traced back at least to random forests (Breiman <span>2001</span>), which are still popular and widely used. The second approach is a nearest-neighbor approach to regression. <span></span><math>\n <semantics>\n <mrow>\n <mi>k</mi>\n </mrow>\n <annotation>$$ k $$</annotation>\n </semantics></math>-nearest neighbor methods identify those <span></span><math>\n <semantics>\n <mrow>\n <mi>k</mi>\n </mrow>\n <annotation>$$ k $$</annotation>\n </semantics></math> cases from the training sample, which are closest to a given data vector for which we need a prediction. Such methods are memory-intensive and thus infeasible for large data sets, but otherwise may perform well. Finally, let us consider quantile regression forests which allow to model and learn quantiles of a distribution. In this way, one can easily integrate mean predictions and prediction intervals. Quantile regression forests is the method advocated by the state-of-the-art large language models from OpenAI, Google, and DeepSeek.</p><p>The results of all procedures for the Indiana citizen data are shown in Table 1. All performance metrics are comparable with those of (Bonas et al. <span>2025</span>) except the continuous rank probability score (CRPS). Thus, for better comparison, the table reproduces results for the ARIMA method considered therein but with CRPS using own calculations. Predictions and prediction intervals were sequentially calculated for the test sample. This means, to predict at the <span></span><math>\n <semantics>\n <mrow>\n <mi>i</mi>\n <mtext>th</mtext>\n </mrow>\n <annotation>$$ i\\mathrm{th} $$</annotation>\n </semantics></math> time point of the test period, the test data from 1 to <span></span><math>\n <semantics>\n <mrow>\n <mi>i</mi>\n <mo>−</mo>\n <mn>1</mn>\n </mrow>\n <annotation>$$ i-1 $$</annotation>\n </semantics></math> were added to the training sample and the prediction model was refitted. Then the prediction and a prediction interval were calculated given the regressors at time <span></span><math>\n <semantics>\n <mrow>\n <mi>i</mi>\n </mrow>\n <annotation>$$ i $$</annotation>\n </semantics></math>. Code for some analyses can be found in the github repository https://github.com/stoch-aix/Forecast-Temperature-And-Wind.</p><p>The study in (Bonas et al. <span>2025</span>) concludes that for temperature prediction, ARIMA is the second best among the methods compared there, but for wind prediction three neural network approaches turned out to be better in terms of MSE, CRPS, coverage and interval score. Let us now compare these findings with the results of the methods considered here. We start with the best method found here, the NLR-ARIMA approach introduced above in detail.</p><p>Weather variables such as temperature and wind speed are complex time series, whose inherently non-stationary behavior is difficult to model. Besides seasonal patterns, the question arises whether they have short or long memory. It is argued that naive estimation of autocorrelations may yield to spurious findings, which can be avoided by proper correction of the (conditional) mean component. For the South Bend data under investigation, one can justify the assumption of a short memory.</p><p>To predict one-day-ahead temperature and wind speed based on daily measurements, various methods from statistics and machine learning can be used, but the performance can be poor when regressors, the model class, or hyperparameters are not well selected. Whereas many machine learning methods implicitly learn suitable transformations, this is not the case for time series regressions. However, a small number of standard non-linear transformations and suitable heteroscedasticity-consistent long-run-variance estimation may significantly help in finding models with competitive predictive power. This is demonstrated by developing a non-linear ARIMA regression, which significantly outperforms the best machine learning method, an echo-state neural network (ESN), found in (Bonas et al. <span>2025</span>). Furthermore, light gradient boosting and the classical method of nearest neighbor regression outperform all machine learning methods in (Bonas et al. <span>2025</span>) except the ESN approach for temperature prediction and are on par with ESN and non-linear ARIMA, respectively, when it comes to wind prediction.</p><p>One can summarize that for data as studied here, which is characterized by a relatively small sample size and low dimensionality, it seems to be difficult to find a machine learning method which can outperform ARIMA regressions relying on a handful of standard non-linearities. When the benefits of typical machine learning methods matter, such as their automatic generation of suitable features and their ability to easily integrate high-dimensional input variables and complex non-linearities, methods such as (LSTM or echo-state) neural networks, nearest neighbor regression, light gradient boosting and quantile regression forests, provide quite effective approaches and may be preferable.</p>","PeriodicalId":50512,"journal":{"name":"Environmetrics","volume":"36 2","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/env.70003","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmetrics","FirstCategoryId":"93","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/env.70003","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
I congratulate the authors for their interesting and insightful paper. Their findings contribute to the ongoing discussion of the pros and cons of machine learning methods and statistical approaches in environmetrics. In my discussion, I want to address some issues in the design of the comparison study and the interpretation of the results, and add some general thoughts. Moreover, I report about my own analyses of the Indiana citizen science data set used by the authors. Specifically, I developed a non-linear ARIMA regression model with improved heteroscedasticity-consistent uncertainty estimation, which turns out to substantially outperform the best method of (Bonas et al. 2025). I also applied a couple of machine learning approaches not examined there, which I regard as quite accessible general purpose machine learning methods. One of those methods is recommended by the early 2025 state-of-the-art AI large language models. I report about the interesting results in the last section.
The authors argue that for non-linear and non-stationary processes, machine learning methods are typically superior due to their non-parametric structure. But here one should recall that the classical but still popular class of feedforward artificial neural networks trained by backpropagation is non-linear least squares regression with a certain non-linear regression function known up to a parameter vector. In its entirety, this was mainly recognized and elaborated by the econometrician Halbert White, see White (1992). And, of course, there are many methods classified as traditional statistical approaches which are non-parametric as well. The differences are more with respect to the dimensionality and sample size, and how the methods deal with it. Many machine learning methods such as artificial neural networks fit high-dimensional overparametrized parametric models, whereas statistical methods usually work with specifically chosen stochastic models aiming at parsimony. Such ML methods process high-dimensional inputs as given, and the purpose of the first stage of a model is to learn features from the input data, sometimes being agnostic with respect to the type and meaning of the input variables. The guiding principle is that one only fixes the basic topology (e.g., fully-connected layers or convolutional layers), and the model can extract optimal features provided the learning sample is large enough. This is in contrast to statistical approaches, which tend to design a model for specific types of inputs and deal with high dimensionality by suitable preprocessing steps, human-controlled feature generation and/or specific assumptions and related estimation methods, especially sparse models and sparse estimation, which combine estimation and variable selection. A further distinction is that statistical models and modeling using stochastic processes often impose a certain structure, which is derived from domain knowledge (e.g., physical theory), and are developed and studied to understand the data generating process. Often, model components and parameters have clear interpretations, and one follows the parsimony principle (Occam's razor) that a model (and, more generally, a scientific theory/explanation) should be as complicated as necessary, but as simple as possible. Non-parametric methods typically assume much less prior knowledge, and often they are designed having in mind clear interpretations as well. However, there is a substantial overlap and intersection between machine learning and statistical methods. Many approaches are extensively used and studied in both fields, for example, ridge and lasso regression, decision tree methods and (orthogonal) basis expansions. Further, more and more hybrid approaches are developed and studied, which fuse statistical and machine learning approaches and thinking.
In their study (Bonas et al. 2025), observed that the machine learning methods provide better predictions than the statistical approaches they considered. They partially attribute this inability of statistical models to certain stylized facts of the data. Especially, they argue that environmental data such as temperature or wind often show up long-memory effects as visible in the plots of the autocorrelation function (ACF), which are beyond the scope of statistical models such as ARIMA models. Firstly, as I discuss below to some extent, one may easily find ARIMA regression models which outperform the best method of (Bonas et al. 2025) (Echo-State-Network). Secondly, as I show in the next section, the observed long-memory is likely to be spurious and thus cannot explain underperformance.
The authors conclude from the ACF plots, which show a very slow decay of autocorrelations, that the temperature and wind series of the Indiana citizen science data set, especially temperature, have long memory. They argue that therefore statistical time series models assuming short memory (such as ARIMA) fail.
I think this conclusion is doubtful. Clearly, wind and temperature belong to the variables that could possess long memory. But long memory and non-stationarity are very difficult to separate, since the latter influences the classical quantities to define and analyze long memory. Slowly decaying autocorrelations can also be explained by (random) level shifts or smooth trends. When naively calculating acf functions from raw data, the series are centered at their arithmetic mean calculated from the whole data. But when the mean is not stationary, this induces biased estimates of the autocorrelation function. Further, tests for non-stationarity, for example, change-point tests, tend to reject the null hypothesis of no change if the underlying series has long memory.
When locally demeaning the temperature and wind series by subtracting the average of the most recent three measurements, a large part of the slow decay is removed, as can be seen from Figure 1, which should be compared to Figure 4 in (Bonas et al. 2025). Correcting by an average calculated over larger windows provides essentially the same result. Bonas et al. (2025) corrected for non-stationarity by only removing the 12 and 24 h harmonics, and this might not be sufficient.
If one relies on a time series linear regression with ARIMA errors, as in (Bonas et al. 2025), one should estimate the autocorrelations from the residuals. When analyzing the residuals from the prediction model of (Bonas et al. 2025), the picture is essentially the same. Actually, it even provides a better removal of non-stationarity, as can be seen from the bottom row of Figure 1.
Although there is some general evidence that temperature and wind might be long memory processes, I think that the data at hand can be suitably modeled by a conditional mean component disturbed by short memory noise.
Models including a linear predictor such as ARIMA regressions go far beyond the simple inclusion of the input variables as given in the data set to be analyzed. Indeed, a suitable choice of standard non-linear transformations often leads to a substantial improvement and a performance which is comparable to methods which learn non-linear influences of the inputs from the data. This especially applies to data sets with moderate sample sizes, where non-parametric estimation of the true non-linear relationships with high-accuracy is hard per se (as indicated by the known slow convergence rates).
Non-linear transformations well-proven in practice are power transformations such as and their linear combination, that is, polynomials. They are mathematically justified by Taylor series expansions of smooth functions. Moreover, logarithms, or , with respect to some suitable basis can be considered as well as square roots, , or, more generally, fractional roots.
Another important fact is that often increments instead of levels matter and are key variables for explanation and prediction, respectively. Then first-order increments, , should be considered. For example, there are physical arguments to analyze increments of wind speed: Local thermal wind is known to be the result of temperature changes. Again, going beyond simple differences, one may take powers, logs and roots of to study non-linear effects.
In other problems, percentages, , and non-linear transformations thereof may be the regressors one needs to analyze instead of levels or differences. In environmetrics, this specifically applies when analyzing monetary variables, for example, the costs associated with emissions or damage due to heavy rain, or the value of renewable energies (wind, solar) which result from wind and temperature, respectively. Here, one can draw from econometrics where usually log returns are considered and form stationary series, whereas prices are usually non-stationary.
Asymmetry effects can be analyzed with indicator variables which are one if exceeds a certain quantile and zero, otherwise. In a similar way, change-points, , can be modeled by adding an indicator to the linear predictor which is one if and zero, otherwise.
To summarize, experimenting with those standard non-linear transformations can lead to parsimonious and interpretable prediction models performing on par with highly sophisticated approaches. Specifically, in a separate paper motivated by this discussion, I propose a Non-linear Robust ARIMA method which outperforms the echo-state neural networks found by the authors to be the best for the Indian citizen data.
The authors publish the data sets they used as benchmark sets for the comparison study as well as large parts of the code used for the analyses.1 This is highly appreciated, since it allows for further studies. As it seems, although there are many institutions which provide environmental data for the public, there are not yet established repositories of environmental benchmark data sets. From a more statistical viewpoint, the purpose of such data sets is mainly to illustrate a method and its properties, whereas comparisons with competitors are usually done by simulation studies for well-selected distributional models. Analyzing example data is just one example, but simulating 10,000 samples from, say, five data generating processes provides 50,000 examples. But in machine learning, example data sets really matter, and, usually, a (new) method is only accepted if it performs well on example data sets.
In (Bonas et al. 2025) the data is divided into a learning sample for training (estimation) and a sample for testing. Unfortunately, an explicit validation sample is missing. Although many methods internally apply cross-validation techniques to have validation data at their disposal, a further validation sample is preferable. Since the validation samples generated by cross-validation methods (or similar approaches having different names) are a (randomly chosen) part of the training sample, the validation results are still more or less in-sample estimates. Evaluating a model using a validation sample prevents overfitting and overly optimistic in-sample summary statistics (MSE, , CPRS, etc.). In machine learning, validation samples are often used to evaluate a model during training, to select hyperparameters by further optimizations, and/or select a model (class). Ideally, test samples should only be used to evaluate the final model. Moreover, having access to an additional validation sample allows the identification of various critical issues, especially non-stationarity or change-points after the training period, which makes extrapolation a delicate endeavor, and the identification of poor coverage of confidence and prediction intervals, respectively. Further discussions of the role of validation samples and its usage for evaluation, model selection and constructing valid prediction intervals, can be found in (Dewolf et al. 2023; Tian et al. 2022; Steland and Pieters 2022), and the references given there.
Certainly, any statistician, whether an applied one or a mathematical statistician, agrees that the quantification of the uncertainty of a data-based decision is crucial, which has been historically addressed by statistics as a science.
In terms of statistical theory, this usually breaks down to hypothesis testing in one of the many flavors of statistical decision science, and to reporting confidence resp. prediction intervals or credible intervals of some kind, when it comes to estimation of parameters or prediction of random variables.
Concerning machine learning, the situation has been considerably improved, and the authors have contributed to the development of a proper method of uncertainty quantification for a class of echo-state neural networks.
In a comparison study, where a test sample is defined, one essentially looks at the performance measures such as coverage probability or interval score. But in a real-world application there is no test sample, and, given the measurements up to now, one looks at the prediction and the associated prediction interval quantifying the uncertainty. Now the question arises, how to interpret the prediction interval.
However, there are various sources of uncertainty resulting in relevant variation, and depending on the selected method of prediction the interpretation of the prediction interval may differ as well. Important sources of variation are uncertainty with respect to the choice of a model class (e.g., parametric regression or non-parametric regression model), uncertainty with respect to the concrete specification within a model (e.g., selection of the right regressors), and uncertainty in terms of the estimation error. In environmetrics, the location error also arises, namely when the measurement locations differ from the locations where we want to predict, (Cressie and Kornack 2003).
Concerning the interpretation of a prediction interval, statistical approaches can broadly be classified as being of a frequentist nature or having a Bayesian flavor. In frequentist statistics, given covariates, the predictor is unknown but fixed and disturbed by a random error term yielding the response, and a prediction interval for the predictor is a random interval which may or may not cover the true predictor. In a Bayesian framework, it may also be a random quantity having its own distribution, and the prediction interval may consist of quantiles of its posterior distribution, allowing the interpretation that the probability that the predictor lies in the prediction interval equals the required level. Clearly, when prediction of the response is of interest, the influence of the error is taken into account in both approaches.
In ensemble methods, the prediction interval is obtained from a constructed or simulated set of ensemble predictions. The ensemble of prediction models can and is often interpreted as a set of experts delivering different predictions, which are then combined to produce a better prediction by leveraging the swarm intelligence of the ensemble. Now, one generates an ensemble of ensemble predictions to construct a prediction interval. Here, the prediction interval may have the interpretation of a range of possible ensemble predictions of a hypothesized universe of ensemble experts, excluding the most extreme experts.
Finally, the rise of generative models in machine learning has led to approaches to uncertainty quantification where given the data, the model-inherent generative sampling mechanism is used to produce a set of outputs representing the uncertainty.
To summarize, one needs to take a closer look at how the prediction interval is calculated, in order to understand its interpretation.
The question arises as to whether the differences between prediction models in terms of the mean squared error are statistically significant in some sense. Especially, since for some pairs of methods the performance measures are quite close to each other, I want to discuss this issue for the mean squared error in prediction, because it is quite commonly used among researchers and practitioners. For large classes of models and methods of statistical estimation, asymptotic tests for pairwise comparisons of dynamic non-linear models have been proposed in (Rivers and Vuong 2002) based on in-sample as well as out-of-sample lack-of-fit criteria. The latter approach uses the model fit of the training sample and evaluates its performance in the test sample.
As an example, let us consider the temperature measurements of the Indiana citizen science data set. The in-sample MSE of the ARIMA prediction model of (Bonas et al. 2025) is , and a confidence interval is given by . When analyzing the out-of-sample performance of the model fit using the test sample of size , one obtains for the ARIMA prediction model of (Bonas et al. 2025) the confidence interval around the out-of-sample MSE , which differs from the MSE obtained when refitting the model at each time point using all available data. Both confidence intervals were calculated using a Bartlett-type estimator of the long-run-variance as above. This shows that the uncertainty of the MSE is quite large, and one cannot be sure whether the corresponding out-of-sample MSEs of, say, ARIMA and Echo-State-Network, are significantly different and thus due to a better expected performance of echo state networks, or whether they differ simply due to randomness such that their ordering can not be interpreted for the data at hand.
To obtain highly selective results, one needs to consider larger training samples than considered here, possibly in the range of several years.
This section reports about the analyses I conducted for the Indiana citizen science data. First, I present a suitable non-linear ARIMA regression model. Then, I briefly describe some machine learning methods I consider to offer a good compromise between flexibility, ease of handling and required computational costs for predicting environmental data on a not-too-large scale.
The first approach is an improvement of extreme gradient boosting and belongs to the class of ensemble methods which combine decision trees, an idea which can be traced back at least to random forests (Breiman 2001), which are still popular and widely used. The second approach is a nearest-neighbor approach to regression. -nearest neighbor methods identify those cases from the training sample, which are closest to a given data vector for which we need a prediction. Such methods are memory-intensive and thus infeasible for large data sets, but otherwise may perform well. Finally, let us consider quantile regression forests which allow to model and learn quantiles of a distribution. In this way, one can easily integrate mean predictions and prediction intervals. Quantile regression forests is the method advocated by the state-of-the-art large language models from OpenAI, Google, and DeepSeek.
The results of all procedures for the Indiana citizen data are shown in Table 1. All performance metrics are comparable with those of (Bonas et al. 2025) except the continuous rank probability score (CRPS). Thus, for better comparison, the table reproduces results for the ARIMA method considered therein but with CRPS using own calculations. Predictions and prediction intervals were sequentially calculated for the test sample. This means, to predict at the time point of the test period, the test data from 1 to were added to the training sample and the prediction model was refitted. Then the prediction and a prediction interval were calculated given the regressors at time . Code for some analyses can be found in the github repository https://github.com/stoch-aix/Forecast-Temperature-And-Wind.
The study in (Bonas et al. 2025) concludes that for temperature prediction, ARIMA is the second best among the methods compared there, but for wind prediction three neural network approaches turned out to be better in terms of MSE, CRPS, coverage and interval score. Let us now compare these findings with the results of the methods considered here. We start with the best method found here, the NLR-ARIMA approach introduced above in detail.
Weather variables such as temperature and wind speed are complex time series, whose inherently non-stationary behavior is difficult to model. Besides seasonal patterns, the question arises whether they have short or long memory. It is argued that naive estimation of autocorrelations may yield to spurious findings, which can be avoided by proper correction of the (conditional) mean component. For the South Bend data under investigation, one can justify the assumption of a short memory.
To predict one-day-ahead temperature and wind speed based on daily measurements, various methods from statistics and machine learning can be used, but the performance can be poor when regressors, the model class, or hyperparameters are not well selected. Whereas many machine learning methods implicitly learn suitable transformations, this is not the case for time series regressions. However, a small number of standard non-linear transformations and suitable heteroscedasticity-consistent long-run-variance estimation may significantly help in finding models with competitive predictive power. This is demonstrated by developing a non-linear ARIMA regression, which significantly outperforms the best machine learning method, an echo-state neural network (ESN), found in (Bonas et al. 2025). Furthermore, light gradient boosting and the classical method of nearest neighbor regression outperform all machine learning methods in (Bonas et al. 2025) except the ESN approach for temperature prediction and are on par with ESN and non-linear ARIMA, respectively, when it comes to wind prediction.
One can summarize that for data as studied here, which is characterized by a relatively small sample size and low dimensionality, it seems to be difficult to find a machine learning method which can outperform ARIMA regressions relying on a handful of standard non-linearities. When the benefits of typical machine learning methods matter, such as their automatic generation of suitable features and their ability to easily integrate high-dimensional input variables and complex non-linearities, methods such as (LSTM or echo-state) neural networks, nearest neighbor regression, light gradient boosting and quantile regression forests, provide quite effective approaches and may be preferable.
期刊介绍:
Environmetrics, the official journal of The International Environmetrics Society (TIES), an Association of the International Statistical Institute, is devoted to the dissemination of high-quality quantitative research in the environmental sciences.
The journal welcomes pertinent and innovative submissions from quantitative disciplines developing new statistical and mathematical techniques, methods, and theories that solve modern environmental problems. Articles must proffer substantive, new statistical or mathematical advances to answer important scientific questions in the environmental sciences, or must develop novel or enhanced statistical methodology with clear applications to environmental science. New methods should be illustrated with recent environmental data.