Interpretable SHAP-based machine learning-assisted design for selecting ultrafiltration membranes in protein-laden phosphate wastewater

Cleaner Chemical Engineering Pub Date : 2025-06-06 DOI:10.1016/j.clce.2025.100187

Lukka Thuyavan Yogarathinam , Sani I. Abba , Jamilu Usman , Muthumareeswaran Ramamoorthy , Isam H. Aljundi

{"title":"Interpretable SHAP-based machine learning-assisted design for selecting ultrafiltration membranes in protein-laden phosphate wastewater","authors":"Lukka Thuyavan Yogarathinam , Sani I. Abba , Jamilu Usman , Muthumareeswaran Ramamoorthy , Isam H. Aljundi","doi":"10.1016/j.clce.2025.100187","DOIUrl":null,"url":null,"abstract":"<div><div>Industrial wastewater contaminated with proteins and phosphates poses a significant challenge for producing clean water. This study innovatively employed regression-based machine learning (ML) algorithms to predict the separation performance of proteins with varying molecular weights from synthetic phosphate-laden wastewater using commercially available membranes with different pore sizes. The chosen ML tools are bi-layered neural network (BNN), linear regression (LR), least squares support vector machine (LSSVM), and Gaussian process regression (GPR). Correlation was employed to select the most pertinent variables for constructing an effective model combination while safeguarding against data leakage within the frugal dataset. Among the ML tools, the BNN and GPR algorithms demonstrated effective predictive capabilities for protein rejection. The collaborative integration of all input variable combinations resulted in superior predictive accuracy (R²=0.99) for protein rejection, showcasing minimal error rates for both the BNN and GPR algorithms. Interpretable SHapley Additive exPlanations (SHAP) analysis indicated that the molecular weight cutoff (MWCO), protein molecular weight (PMw), and isoelectric point (IEP) were the most influential factors affecting protein separation performance, with mean SHAP values of approximately 25, 12, and 15, respectively. The ML tools revealed that the input variables of MWCO, PMw, and IEP exerted a more substantial impact compared to hydro-dynamic variables. This study provides insights into advancing the development of ML tools tailored to sparse datasets, particularly for accurately predicting protein separation from phosphate-laden wastewater.</div></div>","PeriodicalId":100251,"journal":{"name":"Cleaner Chemical Engineering","volume":"11 ","pages":"Article 100187"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cleaner Chemical Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772782325000427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Industrial wastewater contaminated with proteins and phosphates poses a significant challenge for producing clean water. This study innovatively employed regression-based machine learning (ML) algorithms to predict the separation performance of proteins with varying molecular weights from synthetic phosphate-laden wastewater using commercially available membranes with different pore sizes. The chosen ML tools are bi-layered neural network (BNN), linear regression (LR), least squares support vector machine (LSSVM), and Gaussian process regression (GPR). Correlation was employed to select the most pertinent variables for constructing an effective model combination while safeguarding against data leakage within the frugal dataset. Among the ML tools, the BNN and GPR algorithms demonstrated effective predictive capabilities for protein rejection. The collaborative integration of all input variable combinations resulted in superior predictive accuracy (R²=0.99) for protein rejection, showcasing minimal error rates for both the BNN and GPR algorithms. Interpretable SHapley Additive exPlanations (SHAP) analysis indicated that the molecular weight cutoff (MWCO), protein molecular weight (PMw), and isoelectric point (IEP) were the most influential factors affecting protein separation performance, with mean SHAP values of approximately 25, 12, and 15, respectively. The ML tools revealed that the input variables of MWCO, PMw, and IEP exerted a more substantial impact compared to hydro-dynamic variables. This study provides insights into advancing the development of ML tools tailored to sparse datasets, particularly for accurately predicting protein separation from phosphate-laden wastewater.

查看原文本刊更多论文

基于可解释shap的机器学习辅助设计在蛋白负载磷酸盐废水中选择超滤膜

被蛋白质和磷酸盐污染的工业废水对生产清洁水提出了重大挑战。本研究创新性地采用了基于回归的机器学习（ML）算法，利用市售的不同孔径的膜，预测了不同分子量的蛋白质从含磷合成废水中的分离性能。选择的机器学习工具是双层神经网络（BNN）、线性回归（LR）、最小二乘支持向量机（LSSVM）和高斯过程回归（GPR）。采用相关性来选择最相关的变量，以构建有效的模型组合，同时防止节俭数据集中的数据泄漏。在ML工具中，BNN和GPR算法显示出对蛋白质排斥的有效预测能力。所有输入变量组合的协作集成导致了蛋白质排斥的卓越预测精度（R²=0.99），BNN和GPR算法的错误率都最小。SHapley可解释性解释（SHapley Additive explanation， SHAP）分析表明，分子量截断值（MWCO）、蛋白质分子量（PMw）和等电点（IEP）是影响蛋白质分离性能的主要因素，其SHapley可解释性解释值的平均值分别约为25、12和15。ML工具显示，与水动力变量相比，MWCO、PMw和IEP的输入变量具有更大的影响。该研究为推进针对稀疏数据集的ML工具的开发提供了见解，特别是用于准确预测从含磷酸盐废水中分离蛋白质。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cleaner Chemical Engineering

自引率

0.00%

发文量