Toward Generalizable Machine Learning Models for Dichloroacetonitrile Formation: Interpretable Insights and a Framework for Model Reliability

IF 12.4 1区环境科学与生态学 Q1 ENGINEERING, ENVIRONMENTAL

Water Research Pub Date : 2025-10-15 DOI:10.1016/j.watres.2025.124823

Rabbi Sikder, Guanghui Hua, Tao Ye

{"title":"Toward Generalizable Machine Learning Models for Dichloroacetonitrile Formation: Interpretable Insights and a Framework for Model Reliability","authors":"Rabbi Sikder, Guanghui Hua, Tao Ye","doi":"10.1016/j.watres.2025.124823","DOIUrl":null,"url":null,"abstract":"Haloacetonitriles (HANs) are highly toxic disinfection byproduct-detected in drinking water. In this study, we applied machine learning (ML) to investigate the formation of dichloroacetonitrile (DCAN), the most common HAN, using a large literature-derived dataset. Among four models evaluated, CatBoost demonstrated the best predictive performance. SHapley Additive exPlanation (SHAP) analysis revealed that DCAN formation is not solely governed by individual parameters but is substantially influenced by feature interactions. For instance, while dissolved organic carbon (DOC) is generally positively correlated with DCAN formation, this relationship trends to weaken at higher specific ultraviolet absorbance at 254 nm (SUVA<sub>254</sub>) values, underscoring the role of non-aromatic fractions in DCAN formation. The interaction between DOC and SUVA<sub>254</sub> is further influenced by the disinfectant, with chloramination generally resulting in lower formation than chlorination. To assess model generalizability, we developed a Reliability Index (RI) framework, which integrates a distributional similarity score (Mahalanobis distance) and an anomaly detection score (One-Class Support Vector Machine) to quantify how representative new data are relative to the training set. The model showed strong performance on an external dataset when RI values exceeded 0.25. This study demonstrates the potential of ML in uncovering complex mechanisms driving DCAN formation and introduces RI as a transferable tool for evaluating the generalizability of predictive models.","PeriodicalId":443,"journal":{"name":"Water Research","volume":"54 1","pages":""},"PeriodicalIF":12.4000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Water Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1016/j.watres.2025.124823","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Haloacetonitriles (HANs) are highly toxic disinfection byproduct-detected in drinking water. In this study, we applied machine learning (ML) to investigate the formation of dichloroacetonitrile (DCAN), the most common HAN, using a large literature-derived dataset. Among four models evaluated, CatBoost demonstrated the best predictive performance. SHapley Additive exPlanation (SHAP) analysis revealed that DCAN formation is not solely governed by individual parameters but is substantially influenced by feature interactions. For instance, while dissolved organic carbon (DOC) is generally positively correlated with DCAN formation, this relationship trends to weaken at higher specific ultraviolet absorbance at 254 nm (SUVA₂₅₄) values, underscoring the role of non-aromatic fractions in DCAN formation. The interaction between DOC and SUVA₂₅₄ is further influenced by the disinfectant, with chloramination generally resulting in lower formation than chlorination. To assess model generalizability, we developed a Reliability Index (RI) framework, which integrates a distributional similarity score (Mahalanobis distance) and an anomaly detection score (One-Class Support Vector Machine) to quantify how representative new data are relative to the training set. The model showed strong performance on an external dataset when RI values exceeded 0.25. This study demonstrates the potential of ML in uncovering complex mechanisms driving DCAN formation and introduces RI as a transferable tool for evaluating the generalizability of predictive models.

Abstract Image

查看原文本刊更多论文

二氯乙腈形成的可推广机器学习模型：可解释的见解和模型可靠性框架

卤乙腈（HANs）是在饮用水中检测到的剧毒消毒副产物。在这项研究中，我们应用机器学习（ML）来研究二氯乙腈（DCAN）的形成，DCAN是最常见的HAN，使用了大量的文献衍生数据集。在评估的四个模型中，CatBoost表现出最好的预测性能。SHapley加性解释（SHAP）分析表明，DCAN的形成不仅受单个参数的控制，而且受到特征相互作用的很大影响。例如，虽然溶解有机碳（DOC）通常与DCAN的形成呈正相关，但在254 nm处更高的比紫外吸光度（SUVA254）值时，这种关系趋于减弱，这强调了非芳香族馏分在DCAN形成中的作用。DOC与SUVA254的相互作用进一步受到消毒剂的影响，氯胺化通常比氯化化更低。为了评估模型的可泛化性，我们开发了一个可靠性指数（RI）框架，该框架集成了分布相似度评分（马氏距离）和异常检测评分（一类支持向量机），以量化具有代表性的新数据相对于训练集的程度。当RI值超过0.25时，该模型在外部数据集上表现出较强的性能。本研究展示了机器学习在揭示驱动DCAN形成的复杂机制方面的潜力，并将RI作为评估预测模型泛化性的可转移工具引入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Water Research 环境科学-工程：环境

CiteScore

20.80

自引率

9.40%

发文量

1307

审稿时长

38 days

期刊介绍： Water Research, along with its open access companion journal Water Research X, serves as a platform for publishing original research papers covering various aspects of the science and technology related to the anthropogenic water cycle, water quality, and its management worldwide. The audience targeted by the journal comprises biologists, chemical engineers, chemists, civil engineers, environmental engineers, limnologists, and microbiologists. The scope of the journal include: •Treatment processes for water and wastewaters (municipal, agricultural, industrial, and on-site treatment), including resource recovery and residuals management; •Urban hydrology including sewer systems, stormwater management, and green infrastructure; •Drinking water treatment and distribution; •Potable and non-potable water reuse; •Sanitation, public health, and risk assessment; •Anaerobic digestion, solid and hazardous waste management, including source characterization and the effects and control of leachates and gaseous emissions; •Contaminants (chemical, microbial, anthropogenic particles such as nanoparticles or microplastics) and related water quality sensing, monitoring, fate, and assessment; •Anthropogenic impacts on inland, tidal, coastal and urban waters, focusing on surface and ground waters, and point and non-point sources of pollution; •Environmental restoration, linked to surface water, groundwater and groundwater remediation; •Analysis of the interfaces between sediments and water, and between water and atmosphere, focusing specifically on anthropogenic impacts; •Mathematical modelling, systems analysis, machine learning, and beneficial use of big data related to the anthropogenic water cycle; •Socio-economic, policy, and regulations studies.