Lukka Thuyavan Yogarathinam , Sani I. Abba , Jamilu Usman , Muthumareeswaran Ramamoorthy , Isam H. Aljundi
{"title":"Interpretable SHAP-based machine learning-assisted design for selecting ultrafiltration membranes in protein-laden phosphate wastewater","authors":"Lukka Thuyavan Yogarathinam , Sani I. Abba , Jamilu Usman , Muthumareeswaran Ramamoorthy , Isam H. Aljundi","doi":"10.1016/j.clce.2025.100187","DOIUrl":null,"url":null,"abstract":"<div><div>Industrial wastewater contaminated with proteins and phosphates poses a significant challenge for producing clean water. This study innovatively employed regression-based machine learning (ML) algorithms to predict the separation performance of proteins with varying molecular weights from synthetic phosphate-laden wastewater using commercially available membranes with different pore sizes. The chosen ML tools are bi-layered neural network (BNN), linear regression (LR), least squares support vector machine (LSSVM), and Gaussian process regression (GPR). Correlation was employed to select the most pertinent variables for constructing an effective model combination while safeguarding against data leakage within the frugal dataset. Among the ML tools, the BNN and GPR algorithms demonstrated effective predictive capabilities for protein rejection. The collaborative integration of all input variable combinations resulted in superior predictive accuracy (R²=0.99) for protein rejection, showcasing minimal error rates for both the BNN and GPR algorithms. Interpretable SHapley Additive exPlanations (SHAP) analysis indicated that the molecular weight cutoff (MWCO), protein molecular weight (PMw), and isoelectric point (IEP) were the most influential factors affecting protein separation performance, with mean SHAP values of approximately 25, 12, and 15, respectively. The ML tools revealed that the input variables of MWCO, PMw, and IEP exerted a more substantial impact compared to hydro-dynamic variables. This study provides insights into advancing the development of ML tools tailored to sparse datasets, particularly for accurately predicting protein separation from phosphate-laden wastewater.</div></div>","PeriodicalId":100251,"journal":{"name":"Cleaner Chemical Engineering","volume":"11 ","pages":"Article 100187"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cleaner Chemical Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772782325000427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Industrial wastewater contaminated with proteins and phosphates poses a significant challenge for producing clean water. This study innovatively employed regression-based machine learning (ML) algorithms to predict the separation performance of proteins with varying molecular weights from synthetic phosphate-laden wastewater using commercially available membranes with different pore sizes. The chosen ML tools are bi-layered neural network (BNN), linear regression (LR), least squares support vector machine (LSSVM), and Gaussian process regression (GPR). Correlation was employed to select the most pertinent variables for constructing an effective model combination while safeguarding against data leakage within the frugal dataset. Among the ML tools, the BNN and GPR algorithms demonstrated effective predictive capabilities for protein rejection. The collaborative integration of all input variable combinations resulted in superior predictive accuracy (R²=0.99) for protein rejection, showcasing minimal error rates for both the BNN and GPR algorithms. Interpretable SHapley Additive exPlanations (SHAP) analysis indicated that the molecular weight cutoff (MWCO), protein molecular weight (PMw), and isoelectric point (IEP) were the most influential factors affecting protein separation performance, with mean SHAP values of approximately 25, 12, and 15, respectively. The ML tools revealed that the input variables of MWCO, PMw, and IEP exerted a more substantial impact compared to hydro-dynamic variables. This study provides insights into advancing the development of ML tools tailored to sparse datasets, particularly for accurately predicting protein separation from phosphate-laden wastewater.