Predicting vector distribution in Europe: at what sample size are species distribution models reliable?

IF 2.6 2区农林科学 Q1 VETERINARY SCIENCES

Frontiers in Veterinary Science Pub Date : 2025-05-29 eCollection Date: 2025-01-01 DOI:10.3389/fvets.2025.1584864

Lianne Mitchel, Guy Hendrickx, Ewan T MacLeod, Cedric Marsboom

{"title":"Predicting vector distribution in Europe: at what sample size are species distribution models reliable?","authors":"Lianne Mitchel, Guy Hendrickx, Ewan T MacLeod, Cedric Marsboom","doi":"10.3389/fvets.2025.1584864","DOIUrl":null,"url":null,"abstract":"Introduction: Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.Objective: Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.Materials and methods: To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).Results: For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.Conclusion: To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km2) and extent (≥10,000 km2). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.","PeriodicalId":12772,"journal":{"name":"Frontiers in Veterinary Science","volume":"12 ","pages":"1584864"},"PeriodicalIF":2.6000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12159067/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Veterinary Science","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.3389/fvets.2025.1584864","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"VETERINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.

Objective: Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.

Materials and methods: To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).

Results: For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.

Conclusion: To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km²) and extent (≥10,000 km²). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.

查看原文本刊更多论文

预测欧洲的病媒分布：在多大的样本量下物种分布模型是可靠的？

物种分布模型通过形成已知病媒分布与环境变量之间的关联，可以预测病媒传播疾病的空间分布。为了应对欧洲不断变化的气候和媒介传播疾病发病率的上升，媒介分布的模型预测可用于改进监测。然而，该领域缺乏标准化，对于多少样本量产生可靠的模型几乎没有共识。目的：确定使用机器学习算法、随机森林和不同样本比例建立的模型的最佳样本量。材料和方法：为了克服真实矢量数据的局限性，在欧洲10个试验点使用完全已知分布的模拟矢量随机生成不同的样本量。试验点的生境适宜性和病媒相对发生面积各不相同。采用24种不同的样本量（10- 5000）和3种不同的样本比例（分别为50:50、20:80和40:60）建立了9000个随机森林模型。使用五个指标评估模型的性能：正确分类的百分比、敏感性、特异性、Cohen’s Kappa和曲线下面积。这些指标按样本量和比例分组。当第25个百分位数达到优秀表现的阈值时，确定最佳样本量，定义为：科恩Kappa为0.605-0.804，其余指标为0.795-0.894（到小数点后三位）。结果：为了平衡样本比例，可靠模型的最佳样本量在750-1,000之间。对于存在和不存在数据的比例分别为40:60的不平衡样本，估计增加到1,100-1,300。相比之下，在考虑任何样本量的情况下，存在和缺席数据比例为20:80的不平衡样本都不能产生可靠的模型。结论：据我们所知，这是第一次使用模拟向量来确定随机森林模型在该分辨率（≤1 km2）和范围（≥10,000 km2）下的最佳样本量。这些结果可能会提高模型预测的可靠性，优化现场采样，并加强病媒监测，以应对不断变化的气候。进一步的研究可能寻求改进这些估计并确认可转移到真正的载体。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in Veterinary Science Veterinary-General Veterinary

CiteScore

4.80

自引率

9.40%

发文量

1870

审稿时长

14 weeks

期刊介绍： Frontiers in Veterinary Science is a global, peer-reviewed, Open Access journal that bridges animal and human health, brings a comparative approach to medical and surgical challenges, and advances innovative biotechnology and therapy. Veterinary research today is interdisciplinary, collaborative, and socially relevant, transforming how we understand and investigate animal health and disease. Fundamental research in emerging infectious diseases, predictive genomics, stem cell therapy, and translational modelling is grounded within the integrative social context of public and environmental health, wildlife conservation, novel biomarkers, societal well-being, and cutting-edge clinical practice and specialization. Frontiers in Veterinary Science brings a 21st-century approach—networked, collaborative, and Open Access—to communicate this progress and innovation to both the specialist and to the wider audience of readers in the field. Frontiers in Veterinary Science publishes articles on outstanding discoveries across a wide spectrum of translational, foundational, and clinical research. The journal''s mission is to bring all relevant veterinary sciences together on a single platform with the goal of improving animal and human health.