{"title":"Toward Generalizable Machine Learning Models for Dichloroacetonitrile Formation: Interpretable Insights and a Framework for Model Reliability","authors":"Rabbi Sikder, Guanghui Hua, Tao Ye","doi":"10.1016/j.watres.2025.124823","DOIUrl":null,"url":null,"abstract":"Haloacetonitriles (HANs) are highly toxic disinfection byproduct-detected in drinking water. In this study, we applied machine learning (ML) to investigate the formation of dichloroacetonitrile (DCAN), the most common HAN, using a large literature-derived dataset. Among four models evaluated, CatBoost demonstrated the best predictive performance. SHapley Additive exPlanation (SHAP) analysis revealed that DCAN formation is not solely governed by individual parameters but is substantially influenced by feature interactions. For instance, while dissolved organic carbon (DOC) is generally positively correlated with DCAN formation, this relationship trends to weaken at higher specific ultraviolet absorbance at 254 nm (SUVA<sub>254</sub>) values, underscoring the role of non-aromatic fractions in DCAN formation. The interaction between DOC and SUVA<sub>254</sub> is further influenced by the disinfectant, with chloramination generally resulting in lower formation than chlorination. To assess model generalizability, we developed a Reliability Index (RI) framework, which integrates a distributional similarity score (Mahalanobis distance) and an anomaly detection score (One-Class Support Vector Machine) to quantify how representative new data are relative to the training set. The model showed strong performance on an external dataset when RI values exceeded 0.25. This study demonstrates the potential of ML in uncovering complex mechanisms driving DCAN formation and introduces RI as a transferable tool for evaluating the generalizability of predictive models.","PeriodicalId":443,"journal":{"name":"Water Research","volume":"54 1","pages":""},"PeriodicalIF":12.4000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Water Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1016/j.watres.2025.124823","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Haloacetonitriles (HANs) are highly toxic disinfection byproduct-detected in drinking water. In this study, we applied machine learning (ML) to investigate the formation of dichloroacetonitrile (DCAN), the most common HAN, using a large literature-derived dataset. Among four models evaluated, CatBoost demonstrated the best predictive performance. SHapley Additive exPlanation (SHAP) analysis revealed that DCAN formation is not solely governed by individual parameters but is substantially influenced by feature interactions. For instance, while dissolved organic carbon (DOC) is generally positively correlated with DCAN formation, this relationship trends to weaken at higher specific ultraviolet absorbance at 254 nm (SUVA254) values, underscoring the role of non-aromatic fractions in DCAN formation. The interaction between DOC and SUVA254 is further influenced by the disinfectant, with chloramination generally resulting in lower formation than chlorination. To assess model generalizability, we developed a Reliability Index (RI) framework, which integrates a distributional similarity score (Mahalanobis distance) and an anomaly detection score (One-Class Support Vector Machine) to quantify how representative new data are relative to the training set. The model showed strong performance on an external dataset when RI values exceeded 0.25. This study demonstrates the potential of ML in uncovering complex mechanisms driving DCAN formation and introduces RI as a transferable tool for evaluating the generalizability of predictive models.
期刊介绍:
Water Research, along with its open access companion journal Water Research X, serves as a platform for publishing original research papers covering various aspects of the science and technology related to the anthropogenic water cycle, water quality, and its management worldwide. The audience targeted by the journal comprises biologists, chemical engineers, chemists, civil engineers, environmental engineers, limnologists, and microbiologists. The scope of the journal include:
•Treatment processes for water and wastewaters (municipal, agricultural, industrial, and on-site treatment), including resource recovery and residuals management;
•Urban hydrology including sewer systems, stormwater management, and green infrastructure;
•Drinking water treatment and distribution;
•Potable and non-potable water reuse;
•Sanitation, public health, and risk assessment;
•Anaerobic digestion, solid and hazardous waste management, including source characterization and the effects and control of leachates and gaseous emissions;
•Contaminants (chemical, microbial, anthropogenic particles such as nanoparticles or microplastics) and related water quality sensing, monitoring, fate, and assessment;
•Anthropogenic impacts on inland, tidal, coastal and urban waters, focusing on surface and ground waters, and point and non-point sources of pollution;
•Environmental restoration, linked to surface water, groundwater and groundwater remediation;
•Analysis of the interfaces between sediments and water, and between water and atmosphere, focusing specifically on anthropogenic impacts;
•Mathematical modelling, systems analysis, machine learning, and beneficial use of big data related to the anthropogenic water cycle;
•Socio-economic, policy, and regulations studies.