Official Statistics

Model diagnostics

Updated 24 April 2025

Applies to England

Model diagnostics

In 2023 and 2024, scientists at the UK Center for Ecology and Hydrology conducted some work investigating the performance of the model underpinning the 2024 version of the indicator. The results of that work are presented here, alongside the accompanying data file.

We investigated the performance of the model in a variety of ways:

  • General model diagnostics
  • Sensitivity analysis
  • Cross-validation:
    • Predictive accuracy
    • Influence of the degree of pre-smoothing on predictive accuracy

General model diagnostics

General model diagnostics included several model checks. Initially, we visually examined the MCMC trace plots for each parameter, encompassing parameters related to smoothing, growth rates and other model components. The total number of parameters varied depending on the number of knots used in the model. These plots exhibited no discernible trends, drifts, or irregular patterns, indicating satisfactory convergence. In addition to visual inspection of trace plots for each parameter, we conducted convergence diagnostics using the Gelman-Rubin statistic (Gelman and Rubin, 1992) for different taxonomic groups. The diagnostic values for all considered models based on different groups all indicated that convergence was acceptable.

We thoroughly examined the posterior distributions of the estimated parameters (including the growth rates and smoothing parameters) to evaluate their precision and credible intervals. While the majority of parameters exhibited satisfactory precision and credible intervals, we observed notable variations for certain parameters, particularly when the model was applied to the original data, without pre-smoothing. Specifically, some smoothing parameters displayed relatively wide credible intervals (and in some cases, larger estimates) in cases involving a large number of knots for certain taxonomic groups. However, despite the presence of these uncertainties, our cross-validation analysis, as detailed in Cross validation, revealed no apparent issues with predictive accuracy associated with parameters exhibiting large credible intervals or larger estimates.

Sensitivity analysis

Sensitivity analysis was performed to evaluate the robustness of our model against small perturbations in the parameters of prior distributions. Specifically, we systematically varied the parameters of prior distributions by ±5% of their initial values and observed the corresponding changes in model outputs, including parameter estimated mean and standard deviation values, as well as the values for the multispecies index. This range ±5% was chosen to encompass plausible fluctuations while remaining within a reasonable deviation from the original values. The results suggest that the model demonstrates stable behaviour and is not highly sensitive to these minor perturbations. This finding further strengthens our confidence in the model’s reliability.

Cross validation

Here we present the results of two cross validation exercises. The first exercise assesses the predictive accuracy of our model under different smoothing options, in order to assess whether the outputs are reliable. The second exercise explores the role of pre-smoothing on the model’s performance.

Blocked cross-validation (CV) (Snijders, 1988) provides valuable insights into the predictive capability of statistical models to unseen data, with its fundamental principle revolving around dividing the data into subsets: one for “training” the model and another for “testing” its predictions within temporal blocks.

To thoroughly assess the model’s predictive abilities across various temporal contexts, we employed four types of blocked cross-validation:

  • 5-Fold Cross-Validation (CV1), involving partitioned data into five distinct folds (20% of consecutive years) at a time
  • Leaving-one-year-out (CV2), which entails predicting an entire year at a time
  • 5 Years Block (CV3), involving predicting any 5 years, not necessarily consecutive
  • K-years Cross-Validation (CV4), which partitioned data into either five or two years

These methods evaluate the model’s ability to generalise over consecutive and non-consecutive time intervals. CV2 is computationally intensive and was therefore applied to a subset of comparisons, and CV3 couldn’t be implemented for vascular plants and freshwater invertebrates due to limited available data. All cross-validation types were applied separately to each taxonomic group and some of them to the combined dataset.

Further, we used six metrics to assess the model’s predictive accuracy:

  • RMSE (Root Mean Squared Error) which emphasises error magnitude
  • NRMSE (Normalised Root Mean Squared Error) normalises error by observed value scale
  • SI (Scatter Index) assesses symmetry and spread
  • MAE (Mean Absolute Error) focuses on absolute difference
  • MAPE (Mean Absolute Percentage Error) evaluates percentage difference
  • MASE (Mean Absolute Scaled Error) compares performance against a naive forecast

We present the results using two approaches: firstly, the results are shown as percentages, indicating the improvement of our model compared to an intercept-only model. Higher percentages denote better performance. Secondly, we provide evaluation metric values (such as RMSE) to offer a straightforward measure of model performance across taxonomic groups. Percentages alone may not provide a comprehensive understanding of the results, for instance, high percentages might not necessarily signify better outcomes if the absolute metric estimates are extremely high. Similarly, a minimal percentage improvement might be misleading if the metric estimates are close to zero, suggesting that the intercept-only model also performed well. On the other hand, percentages are more comparable between groups than metric values. This dual approach of presenting both the percentage-based analysis with direct metric estimates ensures a more thorough assessment of model performance.

Categorising metrics values presents a challenge due to their wide range of values, spanning from 0 to infinity. Here, we classify values below 0.1 as indicative of highly accurate predictions, 0.25 as moderately accurate, values below 0.5 as indicative of accurate predictions, and values above 0.5 as less accurate. Categorising percentages can also be subjective and may vary depending on the specific context and goals. Here, we classify improvement percentages as follows: greater than 90% is considered nearly perfect performance, above 75% as very good performance, greater than 50% as good, between 25% and 50% as moderate, and below 25% as limited or low improvement. Additionally, in cases where the model performs worse than an intercept-only model and the improvement percentage is zero (or negative), it is considered the worst (very bad) performance.

In order to evaluate predictive performance for individual species groups and the combined dataset, our model was compared to an intercept-only model in which the multispecies index is constrained to hold a constant value over time.

Part 1. Predictive accuracy

We assessed the model’s predictive capability under 2 different smoothing scenarios; 3 and 17 knots. The model with 3 knots is tailored to capture the general trends in the data, offering a broad overview of patterns while potentially overlooking finer details. Conversely, the model incorporating 17 knots is configured to provide a more intricate representation of the inherent patterns within the dataset. We compare our model’s percentage improvement to an intercept-only model across different cross-validation types (based on original, unsmoothed data).

Results for this exercise can be found in the published datafile. Percentages are shown for different metrics, with each group’s performance individually depicted. These files are given for both numbers of knots, demonstrating that the percentages are very similar for 3 and 17 knots, except for moths (CV1), where our model performed worse than the intercept-only model for 17 knots but showed good performance for 3 knots. Overall, our method outperforms the intercept model in most cases, except for moths and butterflies. Performance is consistently good (>50%) or better (> 75%) for other groups, demonstrating substantial improvement in relation to the intercept-only model, with similar results across different metrics, except for MAPE, which couldn’t be calculated for some combinations (which is not a cause for concern due to their sensitivity to small values).

In particular, the model demonstrated almost perfect predictions for mammals, very good for birds and invertebrates, good for plants and fish (with better results for CV2 and CV3), as well as moths in certain cross-validation types (CV2 and CV3), and a very bad performance for butterflies (CV1) or limited/low improvement (CV2, CV3). This suggests the model’s capability to predict better than the intercept-only model for one year or any 5 years but not for 20% of consecutive years, for both moths and butterflies. The influence of the moths group on the overall combined results led to slightly worse percentages for 17 knots compared to 3 knots.

Part 2. Influence of the degree of pre-smoothing on predictive accuracy

We conducted the blocked cross-validation twice: once with the original species trends and a second time with “pre-smoothed” trends. We also compare results from pre-smoothing on the log and measurement scale with one another. Note that pre-smoothing on the measurement scale created 38 species:year combinations (out of nearly 38,000) in which the smoothed abundance estimate was negative: for these we used the unsmoothed value instead.

The pre-smoothing of species trends was applied to all species equally, using the rule of thumb of 0.3 degrees of freedom per year. Although this rule of thumb has been applied for more than two decades (Fewster et al., 2000), it is appropriate to question whether different levels of pre-smoothing would be more appropriate.

To explore how the degree of data pre-smoothing affects the predictive accuracy of our model, we conducted an additional cross-validation analysis. This analysis mainly focused on four groups: butterflies, fish, invertebrates, and moths. However, it’s worth noting that other groups were also considered, albeit with only some model parameters (i.e. specific degrees of pre-smoothing). We considered butterflies and moths due to their high degree of variability from year to year.

We applied a thin-plate spline smoothing technique for data pre-smoothing, adjusting the degrees of freedom to various fractions (0.2, 0.25, 0.3, and 0.35) of the number of years. This allowed us to generate datasets with differing levels of smoothing, ranging from highly smoothed (0.2 fraction) to less pre-smoothed data (0.35 fraction). While we primarily pre-smoothed data on the measurement scale, we also conducted analyses using data pre-smoothed on the log scale for comparative purposes.

We chose CV1 and CV3 for their efficiency compared to the time-consuming CV2, and they were conducted with only one degree of freedom (0.3). We fully explored CV4 with all degrees of freedom (0.2, 0.25, 0.3 and 0.35). The application of a 5-year cross-validation scheme for invertebrates was not possible due to the limited availability of data spanning only 7 years.

Given the consistent findings across multiple metrics in previous results, we focussed on two primary metrics, Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), for CV4.

In addition to investigating the model’s performance with 3 and 17 knots, we also evaluated its performance with 10 knots for CV4, aiming to strike a balance between capturing general trends and retaining finer details.

Results from this exercise are presented in the published datafile. Using pre-smoothed data in general improved model performance across all considered taxonomic groups (and the combined dataset) under all considered cross-validation types, particularly benefiting variable groups like butterflies and moths. For less variable groups, the enhancement with pre-smoothed data may not be as apparent (especially in cases where the results already showed nearly perfect or very good performance), but notable improvements were observed, such as in fish data, where the improvement shifted from “good” to “very good” performance (CV1 and CV3; DF = 0.3). It’s worth noting that while the predictive performance for butterflies remained limited/low (approximately 20%) compared to the intercept-only model in CV1, there was a significant improvement for the butterfly group in CV3, shifting from “limited” (less than 25%) to “very good” (greater than 75%) (DF = 0.3). The predictive performance for moths in CV1 (17 knots) also showed significant improvement, shifting from poor performance (worse than the intercept-only model) to good performance (greater than >50%) (DF = 0.3).

The metrics estimates also reveal better model performance with pre-smoothed data, notably improving predictions for fish and invertebrates to “highly accurate” levels. Pre-smoothing substantially improved performance across metrics for both non-problematic and problematic species. Specifically, CV3 with pre-smoothed data demonstrated “highly accurate” predictions for all groups, including the problematic ones. While metrics values were similar for 3 and 17 knots, the 3 knot model outperformed the 17 knot model in certain circumstances, especially notable in butterflies and moths when CV1 was applied to the original data, but much less pronounced when applied to the pre-smoothed data. Additionally, while the MAPE index could not be estimated for most of the original data, it could be calculated from pre-smoothed data.

The results from both CV1 and CV4 showed similar outcomes, leading to identical conclusions. Due to variations in the number of years of data available for different groups, CV1 results for some groups were closer to those of CV4 with 5 years, while other CV1 results for other groups were closer to CV4 with two years. For non-problematic species, the differences between CV1 and CV4 were insignificant and negligible. However, for highly variable species such as moths and butterflies, we found that using CV4 with different years provided additional insights into model performance. Therefore, the results from CV4 are discussed here, focusing on how different degrees of pre-smoothing influence the outcomes, noting differences with CV1 if they occur.

Notably, both RMSE and MAE for CV4 exhibited similar values and patterns across different degrees of data pre-smoothing and numbers of knots, reinforcing the robustness of our findings.

In certain cases, an intercept-only model exhibited superior fit for data subjected to diverse degrees of pre-smoothing in CV4. Consequently, our focus in describing metrics values is not on comparing different pre-smoothing options (which will be addressed later through percentage comparisons), but rather on evaluating the predictive performance for different numbers of knots. The application of a 5-year cross-validation scheme for fish with 17 knots yielded highly unstable results and therefore is not included. This instability might be attributed to the small number of observations coupled with the model complexity.

Most models yielded nearly identical metric values across different numbers of knots and pre-smoothing levels, with insignificant differences of 0.01 to 0.02 in some cases (CV4). Notable deviation occurred in the freshwater invertebrates group, favouring smaller knot numbers (3 or 10, not 17), although also non-significant based on Wilcoxon signed-rank statistical test (p greater than 0.05). These findings underscore the challenge of determining the optimal knot number based solely on these results, suggesting the need for alternative approaches.

We observe that most results demonstrate accurate to highly accurate predictions, even for traditionally variable groups like butterflies and moths. While predicting consecutive five years resulted in larger absolute metric values for these groups, they still remained within the range of accurate predictions. Additionally, moderately accurate predictions were achieved for data with higher levels of pre-smoothing. Overall, increased pre-smoothing led to decreased absolute values across most groups, indicating improved model performance. However, for freshwater invertebrates, highly pre-smoothed data resulted in the worst model performance, suggesting the need for individualised selection of smoothing degrees for each group.

The results show that, overall, higher levels of pre-smoothing (degrees of freedom = 0.2 times the number of years) lead to better percentage improvement in both RMSE and MAE metrics when predicting either two or five years, except for invertebrates. Invertebrates demonstrated better results (96% to 97%) with a lesser degree of pre-smoothing applied to the data, while the worst outcomes (71% to 76%) were observed with a higher degree of pre-smoothing.

Both fish and moths achieved nearly perfect predictions when forecasting two years across all levels of pre-smoothing, and at least very good predictions when predicting five years (and at least good with CV1). Conversely, invertebrates showed only good performance with highly pre-smoothed data, while nearly perfect performance across other pre-smoothing degrees. The most significant variations were found in butterflies, where pre-smoothing greatly affected performance, ranging from moderate (RMSE: 34%-37%; 5 years CV4) to good (RMSE: 71%; 5 years CV4) or nearly perfect (RMSE: 90%; 2 years CV4) depending on the degree of pre-smoothing and the prediction year span.

Similar results were obtained when the model was applied to data pre-smoothed on a log-scale, leading to the same conclusion.

Conclusions

The cross-validation reveals key insights into our method’s performance, informing its strengths and limitations and guiding potential optimisations.

Our model consistently outperformed alternative models in predictive ability. We identified butterflies and moths as more difficult groups to model, where predictive performance sometimes lags behind or only shows limited improvement over an intercept-only model. This confirms the ability of cross-validation to diagnose problems with our models, giving confidence to the approach.

Determining the optimal number of knots to smooth the final model remains challenging due to similar performance across different knot numbers for all cross-validation types, requiring alternative methods or expert input. Our findings also reveal that models with varying degrees of smoothing performed differently across different taxonomic groups.

Pre-smoothing species data notably enhances model performance across all groups, particularly benefiting more variable ones. While some groups demonstrated better performance with models employing greater levels of data pre-smoothing, others showed better results with less data pre-smoothing. Our investigation showed that the degree of pre-smoothing emerges as particularly crucial for the highly variable groups such as butterflies, indicating its sensitivity to this parameter. Conversely, for less variable groups and moths, the degree of pre-smoothing appears to exert less pronounced effects on model performance. Consequently, selecting the appropriate degree of pre-smoothing requires careful consideration. One approach could involve selecting the degree of smoothing separately for each group based on their respective performance metrics. Alternatively, we may opt for the values that yielded the best percentage improvement across all degrees of freedom for smoothing and all knots.

However, given that our pre-smoothing analysis focused on only four groups, uncertainties remain regarding the optimal degree of freedom for broader applicability, necessitating further research and validation. Additionally, our analysis reveals that the choice between pre-smoothing on a log or original scale does not significantly impact the results, suggesting that the model’s performance remains consistent regardless of the pre-smoothing method employed. This study offers valuable insights into the performance of our method and sheds light on nuances surrounding the degree of data pre-smoothing.

Our two cross-validation exercises provide valuable insights into the performance of our method and sheds light on nuances surrounding the degree of data pre-smoothing. The results demonstrate that the method has good statistical properties, both in absolute and relative terms. Moreover, the cross-validation was able to detect poor performance among groups that were known to be problematic and confirm that performance was substantially improved by pre-smoothing. Overall, this provides confidence that the method is sound and fit for the purpose of producing multispecies indicators.

References

  • Gelman, A., Rubin, D., (1992) Inference from Iterative Simulation Using Multiple Sequences. Statistical Science 7 (4) 457 - 472, https://doi.org/10.1214/ss/1177011136
  • Snijders, T.A. (1988). On cross-validation for predictor evaluation in time series. In: On Model Uncertainty and its Statistical Implications, pp. 56–69. Springer
  • Fewster, R M., S T. Buckland, G M. Siriwardena, S R. Baillie, and J D. Wilson (2000). Analysis of Population Trends for Farmland Birds Using Generalized Additive Models. Ecology 81, 1970–84