Lianne Mitchel, Guy Hendrickx, Ewan T MacLeod, Cedric Marsboom
{"title":"Predicting vector distribution in Europe: at what sample size are species distribution models reliable?","authors":"Lianne Mitchel, Guy Hendrickx, Ewan T MacLeod, Cedric Marsboom","doi":"10.3389/fvets.2025.1584864","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.</p><p><strong>Objective: </strong>Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.</p><p><strong>Materials and methods: </strong>To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).</p><p><strong>Results: </strong>For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.</p><p><strong>Conclusion: </strong>To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km<sup>2</sup>) and extent (≥10,000 km<sup>2</sup>). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.</p>","PeriodicalId":12772,"journal":{"name":"Frontiers in Veterinary Science","volume":"12 ","pages":"1584864"},"PeriodicalIF":2.6000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12159067/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Veterinary Science","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.3389/fvets.2025.1584864","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"VETERINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.
Objective: Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.
Materials and methods: To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector's relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10-5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen's Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605-0.804 for Cohen's Kappa and 0.795-0.894 for the remaining metrics (to three decimal places).
Results: For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750-1,000. Estimates increased to 1,100-1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.
Conclusion: To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km2) and extent (≥10,000 km2). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.
期刊介绍:
Frontiers in Veterinary Science is a global, peer-reviewed, Open Access journal that bridges animal and human health, brings a comparative approach to medical and surgical challenges, and advances innovative biotechnology and therapy.
Veterinary research today is interdisciplinary, collaborative, and socially relevant, transforming how we understand and investigate animal health and disease. Fundamental research in emerging infectious diseases, predictive genomics, stem cell therapy, and translational modelling is grounded within the integrative social context of public and environmental health, wildlife conservation, novel biomarkers, societal well-being, and cutting-edge clinical practice and specialization. Frontiers in Veterinary Science brings a 21st-century approach—networked, collaborative, and Open Access—to communicate this progress and innovation to both the specialist and to the wider audience of readers in the field.
Frontiers in Veterinary Science publishes articles on outstanding discoveries across a wide spectrum of translational, foundational, and clinical research. The journal''s mission is to bring all relevant veterinary sciences together on a single platform with the goal of improving animal and human health.