Bin Wu, Zhenming Shi, Hongchao Zheng, Ming Peng, Shaoqiang Meng
{"title":"Impact of sampling for landslide susceptibility assessment using interpretable machine learning models","authors":"Bin Wu, Zhenming Shi, Hongchao Zheng, Ming Peng, Shaoqiang Meng","doi":"10.1007/s10064-024-03980-8","DOIUrl":null,"url":null,"abstract":"<div><p>Landslide susceptibility assessment has made significant strides in meeting the urgent requirements for disaster prevention and mitigation. However, the inherent imbalance in landslide distributions poses challenges and thus various sampling strategies emerge. Yet, these strategies alter the original dataset distribution, necessitating a deeper understanding of their impact on susceptibility mapping. This study integrates multi-source information, including morphological, geological, hydrological, and land-use data in the northwest of Oregon State, to train four models—Decision Trees, Random Forest, Adaboost, and Gradient Tree Boosting —using both balanced and imbalanced training sets. Results reveal that models trained on imbalanced datasets generally exhibit superior classification performance. Models using balanced datasets predict more positives (landslides) at higher susceptibility levels, while those applied imbalanced datasets classified more negatives at lower levels. By employing the Shapley Additive Explanations method, the consistency in model decision-making was established and identified the top five most influential factors: distance to roads, slope roughness, geological age, roughness, and elevation. Furthermore, the consequences of FN (False Negatives) and FP (False Positives) were discussed, concluding that FN may lead to loss of life, and FP may result from prediction inaccuracies, dataset incompleteness, and forthcoming landslides, hence allowing for a certain amount. It suggests that models with balanced datasets are preferable for minimizing the quantity of FN and effectively capturing landslides at high and very high susceptibility areas. The findings provide valuable insights into the impact of positives and negatives ratios on landslide susceptibility and offer support for optimizing dataset sampling.</p></div>","PeriodicalId":500,"journal":{"name":"Bulletin of Engineering Geology and the Environment","volume":"83 11","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10064-024-03980-8.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of Engineering Geology and the Environment","FirstCategoryId":"5","ListUrlMain":"https://link.springer.com/article/10.1007/s10064-024-03980-8","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Landslide susceptibility assessment has made significant strides in meeting the urgent requirements for disaster prevention and mitigation. However, the inherent imbalance in landslide distributions poses challenges and thus various sampling strategies emerge. Yet, these strategies alter the original dataset distribution, necessitating a deeper understanding of their impact on susceptibility mapping. This study integrates multi-source information, including morphological, geological, hydrological, and land-use data in the northwest of Oregon State, to train four models—Decision Trees, Random Forest, Adaboost, and Gradient Tree Boosting —using both balanced and imbalanced training sets. Results reveal that models trained on imbalanced datasets generally exhibit superior classification performance. Models using balanced datasets predict more positives (landslides) at higher susceptibility levels, while those applied imbalanced datasets classified more negatives at lower levels. By employing the Shapley Additive Explanations method, the consistency in model decision-making was established and identified the top five most influential factors: distance to roads, slope roughness, geological age, roughness, and elevation. Furthermore, the consequences of FN (False Negatives) and FP (False Positives) were discussed, concluding that FN may lead to loss of life, and FP may result from prediction inaccuracies, dataset incompleteness, and forthcoming landslides, hence allowing for a certain amount. It suggests that models with balanced datasets are preferable for minimizing the quantity of FN and effectively capturing landslides at high and very high susceptibility areas. The findings provide valuable insights into the impact of positives and negatives ratios on landslide susceptibility and offer support for optimizing dataset sampling.
期刊介绍:
Engineering geology is defined in the statutes of the IAEG as the science devoted to the investigation, study and solution of engineering and environmental problems which may arise as the result of the interaction between geology and the works or activities of man, as well as of the prediction of and development of measures for the prevention or remediation of geological hazards. Engineering geology embraces:
• the applications/implications of the geomorphology, structural geology, and hydrogeological conditions of geological formations;
• the characterisation of the mineralogical, physico-geomechanical, chemical and hydraulic properties of all earth materials involved in construction, resource recovery and environmental change;
• the assessment of the mechanical and hydrological behaviour of soil and rock masses;
• the prediction of changes to the above properties with time;
• the determination of the parameters to be considered in the stability analysis of engineering works and earth masses.