Fahad Jibrin Abdu , Sani I. Abba , Jamilu Usman , Maad Alowaifeer , Isam H. Aljundi
{"title":"Groundwater health probability risk prediction through oral intake using advanced optimization methods","authors":"Fahad Jibrin Abdu , Sani I. Abba , Jamilu Usman , Maad Alowaifeer , Isam H. Aljundi","doi":"10.1016/j.jconhyd.2025.104670","DOIUrl":null,"url":null,"abstract":"<div><div>Examining the cancer risk associated with oral groundwater (GW) intake is crucial, particularly in regions heavily reliant on GW for human consumption and agriculture. The study was based on real field investigations and controlled laboratory experiments. We integrated real experimental data with generative AI-driven synthetic data to construct a comprehensive dataset. Subsequently, we compared the predictive efficiency of both data sources. We evaluated the reliability of generative AI in generating scientific data, providing critical insights into its applicability for enhancing experimental analysis. The study also evaluates standalone models, including Artificial Neural Networks (ANN), Gaussian Process Regression (GPR), Support Vector Machines (SVM), and Boosted Trees (BT), with and without Bayesian Optimization (BO), for predicting the probability of cancer risk (PCR) from GW ingestion. On real data, during training, ANN achieved the lowest Mean Absolute Error (MAE = 0.1483), Mean Square Error (MSE = 0.1231), and Root Mean Square Error (RMSE = 0.3508), while GPR, SVM, and BT exhibited higher training errors. In the testing phase, ANN continued to lead with an MAE of 0.5733, MSE of 0.6356, and RMSE of 0.7972. When optimized with BO, ANN-BO achieved an MAE of 0.1686, MSE of 0.1097, and RMSE of 0.3312 during training, with GPR + BO close behind (MAE = 0.1679, MSE = 0.1095, RMSE = 0.3310). During testing with BO, ANN-BO further improved (MAE = 0.0902, MSE = 0.0129, RMSE = 0.1136). However, on synthetic data, even optimized models like ANN-BO demonstrated higher testing error (MAE = 15.718, MSE = 374.53, RMSE = 19.353), underscoring limitations in capturing real-world complexities. High error values across models indicate that synthetic data alone is insufficient for accurate health risk assessments. Leveraging real-world data remains essential for enhancing predictive accuracy and minimizing errors, emphasizing the crucial role of data quality in achieving reliable cancer risk predictions from genome-wide (GW) ingestion.</div></div>","PeriodicalId":15530,"journal":{"name":"Journal of contaminant hydrology","volume":"274 ","pages":"Article 104670"},"PeriodicalIF":4.4000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of contaminant hydrology","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169772225001755","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Examining the cancer risk associated with oral groundwater (GW) intake is crucial, particularly in regions heavily reliant on GW for human consumption and agriculture. The study was based on real field investigations and controlled laboratory experiments. We integrated real experimental data with generative AI-driven synthetic data to construct a comprehensive dataset. Subsequently, we compared the predictive efficiency of both data sources. We evaluated the reliability of generative AI in generating scientific data, providing critical insights into its applicability for enhancing experimental analysis. The study also evaluates standalone models, including Artificial Neural Networks (ANN), Gaussian Process Regression (GPR), Support Vector Machines (SVM), and Boosted Trees (BT), with and without Bayesian Optimization (BO), for predicting the probability of cancer risk (PCR) from GW ingestion. On real data, during training, ANN achieved the lowest Mean Absolute Error (MAE = 0.1483), Mean Square Error (MSE = 0.1231), and Root Mean Square Error (RMSE = 0.3508), while GPR, SVM, and BT exhibited higher training errors. In the testing phase, ANN continued to lead with an MAE of 0.5733, MSE of 0.6356, and RMSE of 0.7972. When optimized with BO, ANN-BO achieved an MAE of 0.1686, MSE of 0.1097, and RMSE of 0.3312 during training, with GPR + BO close behind (MAE = 0.1679, MSE = 0.1095, RMSE = 0.3310). During testing with BO, ANN-BO further improved (MAE = 0.0902, MSE = 0.0129, RMSE = 0.1136). However, on synthetic data, even optimized models like ANN-BO demonstrated higher testing error (MAE = 15.718, MSE = 374.53, RMSE = 19.353), underscoring limitations in capturing real-world complexities. High error values across models indicate that synthetic data alone is insufficient for accurate health risk assessments. Leveraging real-world data remains essential for enhancing predictive accuracy and minimizing errors, emphasizing the crucial role of data quality in achieving reliable cancer risk predictions from genome-wide (GW) ingestion.
期刊介绍:
The Journal of Contaminant Hydrology is an international journal publishing scientific articles pertaining to the contamination of subsurface water resources. Emphasis is placed on investigations of the physical, chemical, and biological processes influencing the behavior and fate of organic and inorganic contaminants in the unsaturated (vadose) and saturated (groundwater) zones, as well as at groundwater-surface water interfaces. The ecological impacts of contaminants transported both from and to aquifers are of interest. Articles on contamination of surface water only, without a link to groundwater, are out of the scope. Broad latitude is allowed in identifying contaminants of interest, and include legacy and emerging pollutants, nutrients, nanoparticles, pathogenic microorganisms (e.g., bacteria, viruses, protozoa), microplastics, and various constituents associated with energy production (e.g., methane, carbon dioxide, hydrogen sulfide).
The journal''s scope embraces a wide range of topics including: experimental investigations of contaminant sorption, diffusion, transformation, volatilization and transport in the surface and subsurface; characterization of soil and aquifer properties only as they influence contaminant behavior; development and testing of mathematical models of contaminant behaviour; innovative techniques for restoration of contaminated sites; development of new tools or techniques for monitoring the extent of soil and groundwater contamination; transformation of contaminants in the hyporheic zone; effects of contaminants traversing the hyporheic zone on surface water and groundwater ecosystems; subsurface carbon sequestration and/or turnover; and migration of fluids associated with energy production into groundwater.