Predicting vector distribution in Europe: at what sample size are species distribution models reliable?

IF 2.6 2区 农林科学 Q1 VETERINARY SCIENCES
Frontiers in Veterinary Science Pub Date : 2025-05-29 eCollection Date: 2025-01-01 DOI:10.3389/fvets.2025.1584864
Lianne Mitchel, Guy Hendrickx, Ewan T MacLeod, Cedric Marsboom
{"title":"Predicting vector distribution in Europe: at what sample size are species distribution models reliable?","authors":"Lianne Mitchel, Guy Hendrickx, Ewan T MacLeod, Cedric Marsboom","doi":"10.3389/fvets.2025.1584864","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.</p><p><strong>Objective: </strong>Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.</p><p><strong>Materials and methods: </strong>To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).</p><p><strong>Results: </strong>For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.</p><p><strong>Conclusion: </strong>To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km<sup>2</sup>) and extent (≥10,000 km<sup>2</sup>). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.</p>","PeriodicalId":12772,"journal":{"name":"Frontiers in Veterinary Science","volume":"12 ","pages":"1584864"},"PeriodicalIF":2.6000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12159067/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Veterinary Science","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.3389/fvets.2025.1584864","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"VETERINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.

Objective: Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.

Materials and methods: To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).

Results: For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.

Conclusion: To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km2) and extent (≥10,000 km2). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.

预测欧洲的病媒分布:在多大的样本量下物种分布模型是可靠的?
物种分布模型通过形成已知病媒分布与环境变量之间的关联,可以预测病媒传播疾病的空间分布。为了应对欧洲不断变化的气候和媒介传播疾病发病率的上升,媒介分布的模型预测可用于改进监测。然而,该领域缺乏标准化,对于多少样本量产生可靠的模型几乎没有共识。目的:确定使用机器学习算法、随机森林和不同样本比例建立的模型的最佳样本量。材料和方法:为了克服真实矢量数据的局限性,在欧洲10个试验点使用完全已知分布的模拟矢量随机生成不同的样本量。试验点的生境适宜性和病媒相对发生面积各不相同。采用24种不同的样本量(10- 5000)和3种不同的样本比例(分别为50:50、20:80和40:60)建立了9000个随机森林模型。使用五个指标评估模型的性能:正确分类的百分比、敏感性、特异性、Cohen’s Kappa和曲线下面积。这些指标按样本量和比例分组。当第25个百分位数达到优秀表现的阈值时,确定最佳样本量,定义为:科恩Kappa为0.605-0.804,其余指标为0.795-0.894(到小数点后三位)。结果:为了平衡样本比例,可靠模型的最佳样本量在750-1,000之间。对于存在和不存在数据的比例分别为40:60的不平衡样本,估计增加到1,100-1,300。相比之下,在考虑任何样本量的情况下,存在和缺席数据比例为20:80的不平衡样本都不能产生可靠的模型。结论:据我们所知,这是第一次使用模拟向量来确定随机森林模型在该分辨率(≤1 km2)和范围(≥10,000 km2)下的最佳样本量。这些结果可能会提高模型预测的可靠性,优化现场采样,并加强病媒监测,以应对不断变化的气候。进一步的研究可能寻求改进这些估计并确认可转移到真正的载体。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Frontiers in Veterinary Science
Frontiers in Veterinary Science Veterinary-General Veterinary
CiteScore
4.80
自引率
9.40%
发文量
1870
审稿时长
14 weeks
期刊介绍: Frontiers in Veterinary Science is a global, peer-reviewed, Open Access journal that bridges animal and human health, brings a comparative approach to medical and surgical challenges, and advances innovative biotechnology and therapy. Veterinary research today is interdisciplinary, collaborative, and socially relevant, transforming how we understand and investigate animal health and disease. Fundamental research in emerging infectious diseases, predictive genomics, stem cell therapy, and translational modelling is grounded within the integrative social context of public and environmental health, wildlife conservation, novel biomarkers, societal well-being, and cutting-edge clinical practice and specialization. Frontiers in Veterinary Science brings a 21st-century approach—networked, collaborative, and Open Access—to communicate this progress and innovation to both the specialist and to the wider audience of readers in the field. Frontiers in Veterinary Science publishes articles on outstanding discoveries across a wide spectrum of translational, foundational, and clinical research. The journal''s mission is to bring all relevant veterinary sciences together on a single platform with the goal of improving animal and human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信