Evaluating Landslide Susceptibility Using Sampling Methodology and Multiple Machine Learning Models

ISPRS Int. J. Geo Inf. Pub Date : 2023-05-13 DOI:10.3390/ijgi12050197

Yingze Song, Degang Yang, Weicheng Wu, Xin Zhang, Jie Zhou, Zhaoxu Tian, Chencan Wang, Yingxu Song

{"title":"Evaluating Landslide Susceptibility Using Sampling Methodology and Multiple Machine Learning Models","authors":"Yingze Song, Degang Yang, Weicheng Wu, Xin Zhang, Jie Zhou, Zhaoxu Tian, Chencan Wang, Yingxu Song","doi":"10.3390/ijgi12050197","DOIUrl":null,"url":null,"abstract":"Landslide susceptibility assessment (LSA) based on machine learning methods has been widely used in landslide geological hazard management and research. However, the problem of sample imbalance in landslide susceptibility assessment, where landslide samples tend to be much smaller than non-landslide samples, is often overlooked. This problem is often one of the important factors affecting the performance of landslide susceptibility models. In this paper, we take the Wanzhou district of Chongqing city as an example, where the total number of data sets is more than 580,000 and the ratio of positive to negative samples is 1:19. We oversample or undersample the unbalanced landslide samples to make them balanced, and then compare the performance of machine learning models with different sampling strategies. Three classic machine learning algorithms, logistic regression, random forest and LightGBM, are used for LSA modeling. The results show that the model trained directly using the unbalanced sample dataset performs the worst, showing an extremely low recall rate, indicating that its predictive ability for landslide samples is extremely low and cannot be applied in practice. Compared with the original dataset, the sample set optimized through certain methods has demonstrated improved predictive performance across various classifiers, manifested in the improvement of AUC value and recall rate. The best model was the random forest model using over-sampling (O_RF) (AUC = 0.932).","PeriodicalId":14614,"journal":{"name":"ISPRS Int. J. Geo Inf.","volume":"62 1","pages":"197"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Int. J. Geo Inf.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/ijgi12050197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Landslide susceptibility assessment (LSA) based on machine learning methods has been widely used in landslide geological hazard management and research. However, the problem of sample imbalance in landslide susceptibility assessment, where landslide samples tend to be much smaller than non-landslide samples, is often overlooked. This problem is often one of the important factors affecting the performance of landslide susceptibility models. In this paper, we take the Wanzhou district of Chongqing city as an example, where the total number of data sets is more than 580,000 and the ratio of positive to negative samples is 1:19. We oversample or undersample the unbalanced landslide samples to make them balanced, and then compare the performance of machine learning models with different sampling strategies. Three classic machine learning algorithms, logistic regression, random forest and LightGBM, are used for LSA modeling. The results show that the model trained directly using the unbalanced sample dataset performs the worst, showing an extremely low recall rate, indicating that its predictive ability for landslide samples is extremely low and cannot be applied in practice. Compared with the original dataset, the sample set optimized through certain methods has demonstrated improved predictive performance across various classifiers, manifested in the improvement of AUC value and recall rate. The best model was the random forest model using over-sampling (O_RF) (AUC = 0.932).

查看原文本刊更多论文

利用抽样方法和多机器学习模型评估滑坡易感性

基于机器学习方法的滑坡易感性评价(LSA)在滑坡地质灾害管理与研究中得到了广泛的应用。然而，在滑坡易感性评价中，滑坡样本往往比非滑坡样本小得多，因而往往忽视了样本不平衡问题。这一问题往往是影响滑坡敏感性模型性能的重要因素之一。本文以重庆市万州区为例，数据集总数超过58万，正样本与负样本之比为1:19。我们对不平衡的滑坡样本进行过采样或欠采样，使其平衡，然后比较不同采样策略下机器学习模型的性能。三种经典的机器学习算法，逻辑回归，随机森林和LightGBM，用于LSA建模。结果表明，直接使用不平衡样本数据集训练的模型表现最差，召回率极低，表明其对滑坡样本的预测能力极低，无法应用于实践。与原始数据集相比，经过一定方法优化后的样本集在各种分类器上的预测性能都有所提高，表现为AUC值和召回率的提高。最佳模型为过度抽样随机森林模型(O_RF) (AUC = 0.932)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ISPRS Int. J. Geo Inf.

自引率

0.00%

发文量