Yanwen Wang , Mahdi Khodadadzadeh , Raúl Zurita-Milla
{"title":"基于聚类样本的地理空间机器学习预测评估的差异性自适应交叉验证方法","authors":"Yanwen Wang , Mahdi Khodadadzadeh , Raúl Zurita-Milla","doi":"10.1016/j.ecoinf.2025.103287","DOIUrl":null,"url":null,"abstract":"<div><div>Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a noticeable challenge for the evaluation of geospatial ML predictions. Neither random nor spatial cross-validation (CV) methods can consistently yield accurate evaluations: Random CV overestimates prediction performance when clustering is high, while spatial CV underestimates it when clustering is low. To tackle this challenge, we propose a novel “adaptive” evaluation method called dissimilarity-adaptive cross-validation (DA-CV), which is based on the data feature space. DA-CV categorizes the prediction locations as “similar” and “different” groups according to the dissimilarity between their covariates and those of the sampled locations. DA-CV applies random CV to evaluate “similar” locations and spatial CV to evaluate “different” ones. The final evaluation metric is obtained through a weighted average of the two. To test DA-CV, we conducted a series of experiments on synthetic species abundance and real above ground biomass datasets, where the clustering degree was gradually changed, and we also compared DA-CV with current CV methods (RDM-CV, SP-CV, and kNNDM) in the experiments. Results showed that DA-CV provided the most accurate evaluations in 85% of scenarios. DA-CV effectively overcomes the common limitations of random and spatial CV methods, such as only considering a part of predictions in the evaluation. This means that DA-CV can provide accurate evaluations for most situations of clustered samples. The success of DA-CV confirms that considering feature space information is an effective way to improve the evaluation of geospatial ML predictions.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"90 ","pages":"Article 103287"},"PeriodicalIF":5.8000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A dissimilarity-adaptive cross-validation method for evaluating geospatial machine learning predictions with clustered samples\",\"authors\":\"Yanwen Wang , Mahdi Khodadadzadeh , Raúl Zurita-Milla\",\"doi\":\"10.1016/j.ecoinf.2025.103287\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a noticeable challenge for the evaluation of geospatial ML predictions. Neither random nor spatial cross-validation (CV) methods can consistently yield accurate evaluations: Random CV overestimates prediction performance when clustering is high, while spatial CV underestimates it when clustering is low. To tackle this challenge, we propose a novel “adaptive” evaluation method called dissimilarity-adaptive cross-validation (DA-CV), which is based on the data feature space. DA-CV categorizes the prediction locations as “similar” and “different” groups according to the dissimilarity between their covariates and those of the sampled locations. DA-CV applies random CV to evaluate “similar” locations and spatial CV to evaluate “different” ones. The final evaluation metric is obtained through a weighted average of the two. To test DA-CV, we conducted a series of experiments on synthetic species abundance and real above ground biomass datasets, where the clustering degree was gradually changed, and we also compared DA-CV with current CV methods (RDM-CV, SP-CV, and kNNDM) in the experiments. Results showed that DA-CV provided the most accurate evaluations in 85% of scenarios. DA-CV effectively overcomes the common limitations of random and spatial CV methods, such as only considering a part of predictions in the evaluation. This means that DA-CV can provide accurate evaluations for most situations of clustered samples. The success of DA-CV confirms that considering feature space information is an effective way to improve the evaluation of geospatial ML predictions.</div></div>\",\"PeriodicalId\":51024,\"journal\":{\"name\":\"Ecological Informatics\",\"volume\":\"90 \",\"pages\":\"Article 103287\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ecological Informatics\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1574954125002961\",\"RegionNum\":2,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954125002961","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
A dissimilarity-adaptive cross-validation method for evaluating geospatial machine learning predictions with clustered samples
Spatially clustered samples are prevalent in geospatial machine learning (ML) predictions, especially in ecological mapping. Since densely sampled regions in the prediction area are overrepresented, leading to dissimilarities in the data distribution between samples and predictions and thus posing a noticeable challenge for the evaluation of geospatial ML predictions. Neither random nor spatial cross-validation (CV) methods can consistently yield accurate evaluations: Random CV overestimates prediction performance when clustering is high, while spatial CV underestimates it when clustering is low. To tackle this challenge, we propose a novel “adaptive” evaluation method called dissimilarity-adaptive cross-validation (DA-CV), which is based on the data feature space. DA-CV categorizes the prediction locations as “similar” and “different” groups according to the dissimilarity between their covariates and those of the sampled locations. DA-CV applies random CV to evaluate “similar” locations and spatial CV to evaluate “different” ones. The final evaluation metric is obtained through a weighted average of the two. To test DA-CV, we conducted a series of experiments on synthetic species abundance and real above ground biomass datasets, where the clustering degree was gradually changed, and we also compared DA-CV with current CV methods (RDM-CV, SP-CV, and kNNDM) in the experiments. Results showed that DA-CV provided the most accurate evaluations in 85% of scenarios. DA-CV effectively overcomes the common limitations of random and spatial CV methods, such as only considering a part of predictions in the evaluation. This means that DA-CV can provide accurate evaluations for most situations of clustered samples. The success of DA-CV confirms that considering feature space information is an effective way to improve the evaluation of geospatial ML predictions.
期刊介绍:
The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.