Yuanhua Liu, Jun Zhang, Michael P Ward, Wei Tu, Lili Yu, Jin Shi, Yi Hu, Fenghua Gao, Zhiguo Cao, Zhijie Zhang
{"title":"Impacts of sample ratio and size on the performance of random forest model to predict the potential distribution of snail habitats.","authors":"Yuanhua Liu, Jun Zhang, Michael P Ward, Wei Tu, Lili Yu, Jin Shi, Yi Hu, Fenghua Gao, Zhiguo Cao, Zhijie Zhang","doi":"10.4081/gh.2023.1151","DOIUrl":null,"url":null,"abstract":"<p><p>Few studies have considered the impacts of sample size and sample ratio of presence and absence points on the results of random forest (RF) testing. We applied this technique for the prediction of the spatial distribution of snail habitats based on a total of 15,000 sample points (5,000 presence samples and 10,000 control points). RF models were built using seven different sample ratios (1:1, 1:2, 1:3, 1:4, 2:1, 3:1, and 4:1) and the optimal ratio was identified via the Area Under the Curve (AUC) statistic. The impact of sample size was compared by RF models under the optimal ratio and the optimal sample size. When the sample size was small, the sampling ratios of 1:1, 1:2 and 1:3 were significantly better than the sample ratios of 4:1 and 3:1 at all four levels of sample sizes (p<0.01) and there was no significant difference among the ratios of 1:1, 1:2 and 1:3 (p>0.05). The sample ratio of 1:2 appeared to be optimal for a relatively large sample size with the lowest quartile deviation. In addition, increasing the sample size produced a higher AUC and a smaller slope and the most suitable sample size found in this study was 2400 (AUC=0.96). This study provides a feasible idea to select an appropriate sample size and sample ratio for ecological niche modelling (ENM) and also provides a scientific basis for the selection of samples to accurately identify and predict snail habitat distributions.</p>","PeriodicalId":56260,"journal":{"name":"Geospatial Health","volume":"18 2","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geospatial Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.4081/gh.2023.1151","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Few studies have considered the impacts of sample size and sample ratio of presence and absence points on the results of random forest (RF) testing. We applied this technique for the prediction of the spatial distribution of snail habitats based on a total of 15,000 sample points (5,000 presence samples and 10,000 control points). RF models were built using seven different sample ratios (1:1, 1:2, 1:3, 1:4, 2:1, 3:1, and 4:1) and the optimal ratio was identified via the Area Under the Curve (AUC) statistic. The impact of sample size was compared by RF models under the optimal ratio and the optimal sample size. When the sample size was small, the sampling ratios of 1:1, 1:2 and 1:3 were significantly better than the sample ratios of 4:1 and 3:1 at all four levels of sample sizes (p<0.01) and there was no significant difference among the ratios of 1:1, 1:2 and 1:3 (p>0.05). The sample ratio of 1:2 appeared to be optimal for a relatively large sample size with the lowest quartile deviation. In addition, increasing the sample size produced a higher AUC and a smaller slope and the most suitable sample size found in this study was 2400 (AUC=0.96). This study provides a feasible idea to select an appropriate sample size and sample ratio for ecological niche modelling (ENM) and also provides a scientific basis for the selection of samples to accurately identify and predict snail habitat distributions.
期刊介绍:
The focus of the journal is on all aspects of the application of geographical information systems, remote sensing, global positioning systems, spatial statistics and other geospatial tools in human and veterinary health. The journal publishes two issues per year.