Probabilistic mapping of imbalanced data for groundwater contamination using classification algorithms: Performance and reliability

IF 4.9 Q2 ENGINEERING, ENVIRONMENTAL
Yang Qiu , Aiguo Zhou , Hanxiang Xiong , Defang Zhang , Cheng Su , Shizheng Zhou , Lin Go , Chi Yang , Hao Cui , Wei Fan , Yao Yu , Fawang Zhang , Chuanming Ma
{"title":"Probabilistic mapping of imbalanced data for groundwater contamination using classification algorithms: Performance and reliability","authors":"Yang Qiu ,&nbsp;Aiguo Zhou ,&nbsp;Hanxiang Xiong ,&nbsp;Defang Zhang ,&nbsp;Cheng Su ,&nbsp;Shizheng Zhou ,&nbsp;Lin Go ,&nbsp;Chi Yang ,&nbsp;Hao Cui ,&nbsp;Wei Fan ,&nbsp;Yao Yu ,&nbsp;Fawang Zhang ,&nbsp;Chuanming Ma","doi":"10.1016/j.gsd.2024.101393","DOIUrl":null,"url":null,"abstract":"<div><div>The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.</div></div>","PeriodicalId":37879,"journal":{"name":"Groundwater for Sustainable Development","volume":"28 ","pages":"Article 101393"},"PeriodicalIF":4.9000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Groundwater for Sustainable Development","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352801X24003163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

The probabilistic mapping of groundwater contamination is a crucial foundation for sustainable groundwater management. However, groundwater data often exhibit imbalance, posing challenges for precise and reliable probability mapping. This study focused on the Jianghan Plain, evaluating the performance and reliability of various sampling and ensemble techniques using a small, imbalanced dataset (n = 246, Class0/Class1 = 0.84/0.16). Probabilistic maps revealed significant spatial variability, with high-probability areas concentrated in the western (Yichang City), eastern (Wuhan), and northern regions (north bank of Han River), while low-probability areas were in the central and southern regions. Over-sampling methods outperformed others by maintaining class balance and enhancing the reliability of mapping outcomes. The high-very high probability areas for over-sampling methods ranged from 15.5% to 18.9%, with larger very low-low areas (60.5%–66.3%). In contrast, under-sampling and ensemble methods showed larger high-very high probability areas (34.0%–53.1%) and smaller very low-low areas (21.6%–46.3%). Over-sampling methods exhibited higher F1 scores (0.27–0.33) and precision (0.375–0.43) compared to other methods. SHAP analysis demonstrated that over-sampling methods balance datasets while preserving information integrity, enhancing the credibility of mapping results. Conversely, ensemble methods faced challenges in statistical analysis, hindering interpretability. We strongly recommend, that in conducting probabilistic mapping of groundwater contamination, it is imperative to adequately consider the imbalance of datasets and not solely rely on metrics like AUC and OA. For small-size datasets akin to this study, SMOTE and ADASYN emerge as recommended sampling methods, they not only yield high-precision mapping results but also ensure interpretability, thereby providing a more reliable basis for sustainable groundwater management.

Abstract Image

求助全文
约1分钟内获得全文 求助全文
来源期刊
Groundwater for Sustainable Development
Groundwater for Sustainable Development Social Sciences-Geography, Planning and Development
CiteScore
11.50
自引率
10.20%
发文量
152
期刊介绍: Groundwater for Sustainable Development is directed to different stakeholders and professionals, including government and non-governmental organizations, international funding agencies, universities, public water institutions, public health and other public/private sector professionals, and other relevant institutions. It is aimed at professionals, academics and students in the fields of disciplines such as: groundwater and its connection to surface hydrology and environment, soil sciences, engineering, ecology, microbiology, atmospheric sciences, analytical chemistry, hydro-engineering, water technology, environmental ethics, economics, public health, policy, as well as social sciences, legal disciplines, or any other area connected with water issues. The objectives of this journal are to facilitate: • The improvement of effective and sustainable management of water resources across the globe. • The improvement of human access to groundwater resources in adequate quantity and good quality. • The meeting of the increasing demand for drinking and irrigation water needed for food security to contribute to a social and economically sound human development. • The creation of a global inter- and multidisciplinary platform and forum to improve our understanding of groundwater resources and to advocate their effective and sustainable management and protection against contamination. • Interdisciplinary information exchange and to stimulate scientific research in the fields of groundwater related sciences and social and health sciences required to achieve the United Nations Millennium Development Goals for sustainable development.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信